Data Science Project - Malicious Websites Detection based on Url Characteristics¶
plt.figure(figsize=(12,9))
img = mpimg.imread('malicious.png')
imgplot = plt.imshow(img)
plt.show()
Introduction¶
The internet plays a crucial role in our daily lives, offering a wealth of resources and opportunities, but it also brings challenges, such as the risk of encountering malicious websites.
A URL (Uniform Resource Locator) is essentially the address used to access resources on the internet. It provides information about the location and the method of retrieving web content.
Components of a URL¶
URLs typically consist of several key components:
- Port (HTTP or HTTPS): Specifies how the browser should communicate with the server.
- Domain name: Represents the web server or website that hosts the resource.
- Path: Identifies the specific resource or webpage on the server.
Additionally, URLs may include:
- Query parameters: Optional data passed to the server to modify or filter the response (search terms).
- Fragment identifiers: Direct users to a specific section of a webpage.
Malicious URLs¶
Malicious URLs are web addresses created with harmful intent. They are designed to deceive users and carry out various malicious activities. These include:
Benign URLs: Legitimate web addresses that pose no threat to users. While they serve their intended purpose without malicious intent, many malicious URLs are designed to mimic benign ones to mislead users into thinking they are visiting a safe site.
Phishing: These URLs are crafted to resemble legitimate websites (such as banks or popular services) to trick users into entering sensitive information like usernames, passwords, or credit card details. Phishing URLs often lead to fraudulent login pages that capture user data.
Defacement: Some malicious URLs lead to websites that have been defaced — where content has been altered or vandalized. Hackers may display unauthorized messages, disrupt services, or damage an organization’s credibility.
Malware: Other malicious URLs are designed to deliver malware, including viruses, ransomware, or spyware. When users click on these URLs, they may unknowingly download harmful software that can compromise their system, steal data, or hold their files hostage.
Although malicious URLs may appear legitimate at first glance, they often conceal attacks that can lead to data breaches, financial loss, or system compromise. As cyber threats continue to evolve, detecting and mitigating these risks is a critical area of focus in cybersecurity.
plt.figure(figsize=(12,9))
img = mpimg.imread('URL1.png')
imgplot = plt.imshow(img)
plt.show()
Motivation¶
With the exponential growth of internet users and web services, malicious URLs have become a primary vector for cyberattacks. These attacks range from phishing schemes that trick users into revealing personal information to malware that silently infects devices. The widespread impact of malicious websites is extensive, including:
- Identity theft
- Data loss
- Financial fraud
- Damage to both individuals and organizations
The rapid growth of the internet has made it an indispensable part of daily life, but it has also become a breeding ground for cyber threats. Malicious websites are a key component of these threats, as cybercriminals continually create harmful URLs to exploit unsuspecting users.
Each year, more than 10 million new malicious websites are created globally, aimed at:
- Delivering malware
- Stealing personal data
- Launching phishing attacks
According to a 2023 report from Symantec, over 1.5 million new phishing sites are created each month. The FBI’s 2022 Internet Crime Report notes that cybercrime cost Americans over $10 billion in a single year, with phishing and malware being major contributors to these losses.
Increasing Sophistication of Malicious Actors¶
Malicious actors are becoming more sophisticated, using advanced techniques to disguise harmful URLs. This makes traditional blacklist-based detection methods struggle to keep up with the speed and volume of emerging threats.
Manual detection is no longer practical due to the vast number of websites. Therefore, automation and scalability have become crucial in addressing this issue.
The Role of Machine Learning¶
Machine learning offers an effective solution for detecting malicious URLs. By analyzing URL patterns and structures, machine learning models can help improve both the efficiency and accuracy of malicious URL detection.
Given these alarming trends, developing a robust, automated system for detecting malicious websites based on URL characteristics is essential to:
- Enhance cybersecurity measures
- Reduce potential harm
- Stay ahead of evolving cyber threats
threats.orbes
Goals¶
The primary goal of this project is to develop an effective system for detecting malicious URLs based on their characteristics. This involves:
- Feature Extraction: Identifying and extracting relevant features from URLs that may indicate malicious intent, such as unusual patterns, length, and the presence of suspicious keywords.
- Machine Learning Models: Implementing machine learning algorithms to classify URLs as benign or malicious based on the extracted features.
- Real-Time Detection: Creating a system capable of analyzing URLs in real-time to provide immediate feedback and protection to users.
- Improving Detection Rates: Aiming to enhance the accuracy and efficiency of existing detection methods to reduce false positives and false negatives.
Challenges and Struggles in Detecting Malicious URLs¶
Detecting malicious URLs is a complex and evolving challenge due to several factors. As attackers continuously refine their techniques, security systems must also adapt. Below are some of the key challenges faced in detecting malicious URLs:
1. Evolving Techniques of Attackers¶
Cybercriminals are constantly developing new ways to disguise their malicious URLs. This includes:
- Obfuscation Techniques: Attackers often obfuscate URLs using URL shortening services, encoding characters, or implementing multiple redirects to hide the actual destination.
- Domain Generation Algorithms (DGA): These algorithms are used to create large numbers of domain names that can be used in attacks, making it difficult for traditional detection methods to keep up.
2. Volume of Data¶
The sheer volume of web traffic makes manual detection impractical. Millions of new URLs are generated daily, requiring automated systems to effectively analyze and identify malicious ones.
3. Sophistication of Malicious URLs¶
Malicious URLs may closely resemble benign ones, making it challenging to differentiate between safe and harmful links. This includes:
- Typosquatting: Attackers create URLs that are similar to legitimate ones but contain slight misspellings, tricking users into visiting harmful sites.
- Phishing Pages: Phishing URLs often use similar branding or design as legitimate sites, making detection difficult.
4. False Positives and Negatives¶
Balancing detection accuracy is crucial:
- False Positives: Legitimate URLs may be flagged as malicious, causing inconvenience for users and potentially damaging trust.
- False Negatives: Malicious URLs that go undetected can lead to significant harm, such as data breaches or financial losses.
- Given the potential consequences, we are particularly vigilant about minimizing false negatives. Allowing a malicious URL to evade detection poses a far greater risk than mistakenly flagging a benign URL. Thus, ensuring accurate detection of malicious URLs is paramount to safeguarding user safety and maintaining trust in our cybersecurity measures.
5. Dynamic and Contextual Nature of URLs¶
The context in which a URL is used can affect its maliciousness. URLs may appear benign in one scenario but could be harmful in another, requiring systems to analyze contextual data effectively.
Problem Statement¶
In this case study, we address the detection of malicious URLs as a multi-class classification problem. Our objective is to classify raw URLs into different categories, including:
- Benign or Safe URLs: Legitimate web addresses that pose no threat to users.
- Phishing URLs: URLs designed to deceive users into providing sensitive information by mimicking legitimate sites.
- Malware URLs: URLs that deliver harmful software to users' devices.
- Defacement URLs: URLs that lead to altered web pages with unauthorized content.
By accurately classifying these URL types, we aim to enhance cybersecurity measures and provide users with better protection against various online threats.
Project Workflow¶
1. Data Preprocessing¶
- Load Data: Import the dataset and inspect its structure.
- Check for Null and Duplicate Values: Identify and remove any missing or duplicated data to ensure dataset quality.
2. Feature Extraction¶
- Extract key features from the URLs, such as:
- Length of the URL.
- Number of special characters.
- Presence of suspicious keywords or domains.
- Tokenizing the URL and creating n-grams.
3. Outlier Handling¶
- Identify and remove extreme outliers using statistical methods (Z-scores or IQR).
4. Train-Test Split¶
- Split the dataset into training and testing sets (80/20 split).
5. Exploratory Data Analysis (EDA)¶
- Target Distribution: Visualize the distribution of the target class (malicious vs non-malicious).
- Feature Distribution: Analyze the distributions of key features in the training data.
6. Feature Engineering¶
- WordCloud for Feature Extraction: Generate WordClouds to extract new features from URLs based on frequently occurring words/phrases.
- EDA for Features: Perform detailed analysis on the extracted features.
7. Feature Selection¶
- Statistical Tests:
- Use
f_classifto assess the importance of numerical features. - Apply Chi-Square tests for binary categorical features.
- Use
- Feature Selection Techniques:
- Use Mutual Information for selecting informative features.
- Apply BorutaPy for selecting features based on their relevance.
8. Model Building¶
- Train 3 Types of Models:
- XGBoost (XGB), LightGBM (LGBM), and CatBoost.
- For each model, use 3 different sets of features:
- All Features.
- Mutual Information (MI) Selected Features.
- BorutaPy Selected Features.
- Optimization: Use Optuna for hyperparameter optimization across all models, resulting in a total of 9 model configurations.
9. Model Evaluation¶
- Compare the performance of all 9 models using metrics such as accuracy, balanced accuracy, precision, recall and F1 score.
- Identify the best-performing models based on evaluation metrics.
10. Model Interpretability¶
- Use SHAP (SHapley Additive exPlanations) to interpret the feature importance and explain the predictions of the best-performing models.
11. Deep Learning Models¶
- Artificial Neural Networks (ANN):
- Doing manual hyperpamaters
- Train an FNN model for URL classification.
- Model evaluation.
- BERT:
- Use BERT (Bidirectional Encoder Representations from Transformers) to classify URLs based on text embeddings.
- Train FNN model.
- Model evaluation.
Dataset Description¶
In this case study, we will be using a Malicious URLs dataset consisting of 651,191 URLs, categorized as follows:
- 428,103 Benign or Safe URLs
- 96,457 Defacement URLs
- 94,111 Phishing URLs
- 32,520 Malware URLs
Now, let’s discuss the different types of URLs in our dataset: Benign, Malware, Phishing, and Defacement URLs.
Benign URLs¶
These are safe to browse URLs. Some examples of benign URLs include:
mp3raid.com/music/krizz_kaliko.htmlinfinitysw.comgoogle.co.inmyspace.com
Malware URLs¶
These types of URLs inject malware into the victim’s system once they visit such URLs. Some examples of malware URLs include:
proplast.co.nzhttp://103.112.226.142:36308/Mozi.mmicroencapsulation.readmyweather.comxo3fhvm5lcvzy92q.download
Defacement URLs¶
Defacement URLs are typically created by hackers with the intention of breaking into a web server and replacing the hosted website with one of their own, using techniques such as code injection or cross-site scripting. Common targets of defacement URLs include religious, government, bank, and corporate websites. Some examples of defacement URLs include:
http://www.vnic.co/khach-hang.htmlhttp://www.raci.it/component/user/reset.htmlhttp://www.approvi.com.br/ck.htmhttp://www.juventudelirica.com.br/index.html
Phishing URLs¶
Phishing URLs are created by hackers to steal sensitive personal or financial information such as login credentials, credit card numbers, and internet banking details. Some examples of phishing URLs are:
roverslands.netcorporacionrossenditotours.comhttp://drive-google-com.fanalav.com/6a7ec96d6acitiprepaid-salarysea-at.tk
Importing Libraries¶
import pandas as pd
import numpy as np
import scipy.stats as stats
import math
import time
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
import matplotlib.ticker as ticker
import matplotlib_inline.backend_inline
matplotlib_inline.backend_inline.set_matplotlib_formats('svg')
%matplotlib inline
from typing import Optional, Callable, Union, Any, Tuple, List
import re
from urllib.parse import urlparse
import tldextract
from sklearn.utils import shuffle, compute_sample_weight
from sklearn.feature_selection import mutual_info_classif, f_classif, SelectKBest
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score, balanced_accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
import xgboost as xgb
import lightgbm as lgb
import catboost as cat
from sklearn.ensemble import RandomForestClassifier
import optuna
import time
from boruta import BorutaPy
from wordcloud import WordCloud
import shap
import warnings
warnings.filterwarnings('ignore', category=FutureWarning)
warnings.filterwarnings('ignore', category=UserWarning)
warnings.filterwarnings('ignore', message="No further splits with positive gain")
Functions¶
def bar_plot(data: pd.DataFrame, x: str, y: str, hue: Optional[str]
,title = Optional[str], xlabel = Optional[str], ylabel = Optional[str]) ->None:
plt.figure(figsize=(10, 6))
plt.title(title)
sns.barplot(x=x, y=y, hue=hue, data=data,legend=True)
plt.xlabel(xlabel)
plt.ylabel(ylabel)
plt.xticks(rotation=45)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()
def plot_corr_matrix(df) -> None:
'''Plots formatted correlation matrix for the supplied df.'''
fig = plt.figure()
fig.set_size_inches(10,8)
ax = fig.add_subplot()
corr_mat = df.corr()
mask = np.triu(corr_mat)
cmap = sns.diverging_palette(220, 20, as_cmap=True)
sns.heatmap(corr_mat, square=True, mask=mask, cmap=cmap,
annot=True, fmt='.2f', vmin=-1, vmax=1,
annot_kws={'fontsize':'small'},
ax=ax);
ax.set_title('Pearson correlation heatmap');
def count_https(url: str) -> int:
return url.count('https')
def count_http(url: str) -> int:
return url.count('http')
def having_ip_address(url: str) -> int:
pattern = (
r'(([01]?\d\d?|2[0-4]\d|25[0-5])\.)' # First part of IPv4
r'([01]?\d\d?|2[0-4]\d|25[0-5])\.' # Second part of IPv4
r'([01]?\d\d?|2[0-4]\d|25[0-5])\.' # Third part of IPv4
r'([01]?\d\d?|2[0-4]\d|25[0-5])|' # Fourth part of IPv4
r'((?:[a-fA-F0-9]{1,4}:){7}[a-fA-F0-9]{1,4})' # IPv6
)
match = re.search(pattern, url)
if match:
return 1
else:
return 0
def abnormal_url(url: str) -> int:
hostname = urlparse(url).hostname
if hostname and re.search(re.escape(hostname), url):
return 1
return 0
def has_subdomain(url: str) -> int:
extracted_info = tldextract.extract(url)
return int( bool(extracted_info.subdomain))
def extract_tld(url):
extracted_info = tldextract.extract(url)
return extracted_info.suffix
def is_risky_tld(url: str) -> int:
tld = extract_tld(url)
risky_tlds = {
'ru', 'cn', 'tk', 'ml', 'ga', 'cf', 'gq', 'work', 'xyz', 'top',
'club', 'men', 'biz', 'info', 'pw', 'cc', 'in', 'us', 'eu', 'co'
}
return tld.lower() in risky_tlds
def is_suspicious_suffix(url: str) -> int:
suspicious_file_extensions = ('.exe', '.php', '.js', '.zip', '.cgi', '.asp', '.aspx')
return int( any(url.lower().endswith(ext) for ext in suspicious_file_extensions) )
def has_shortening_service(url: str) -> int:
pattern = re.compile(r'bit\.ly|goo\.gl|shorte\.st|go2l\.ink|x\.co|ow\.ly|t\.co|tinyurl|tr\.im|is\.gd|cli\.gs|'
r'yfrog\.com|migre\.me|ff\.im|tiny\.cc|url4\.eu|twit\.ac|su\.pr|twurl\.nl|snipurl\.com|'
r'short\.to|BudURL\.com|ping\.fm|post\.ly|Just\.as|bkite\.com|snipr\.com|fic\.kr|loopt\.us|'
r'doiop\.com|short\.ie|kl\.am|wp\.me|rubyurl\.com|om\.ly|to\.ly|bit\.do|t\.co|lnkd\.in|'
r'db\.tt|qr\.ae|adf\.ly|goo\.gl|bitly\.com|cur\.lv|tinyurl\.com|ow\.ly|bit\.ly|ity\.im|'
r'q\.gs|is\.gd|po\.st|bc\.vc|twitthis\.com|u\.to|j\.mp|buzurl\.com|cutt\.us|u\.bb|yourls\.org|'
r'x\.co|prettylinkpro\.com|scrnch\.me|filoops\.info|vzturl\.com|qr\.net|1url\.com|tweez\.me|v\.gd|'
r'tr\.im|link\.zip\.net', re.IGNORECASE)
match = pattern.search(url)
return int(bool(match))
def contains_suspicious_word(url: str) -> int:
pattern = re.compile(r'PayPal|login|signin|bank|account|update|free|lucky|service|bonus|ebayisapi|webscr', re.IGNORECASE)
match = pattern.search(url)
return 1 if match else 0
def longest_digit_sequence(url: str) -> int:
return max(map(len, re.findall(r'\d+', url)), default=0)
def contains_non_ascii(url: str) -> int:
return int( any(ord(char) > 127 for char in url) )
def has_port_number(url: str) -> int:
parsed_url = urlparse(url)
return int( bool(parsed_url.port) )
def count_alpha(url: str) -> int:
alpha = 0
for i in url:
if i.isalpha():
alpha += 1
return alpha
def count_digits(url: str) -> int:
digits = 0
for i in url:
if i.isnumeric():
digits += 1
return digits
def count_hexadecimal_chars(url: str) -> int:
return sum(1 for match in re.findall(r'%[0-9A-Fa-f]{2}', url))
def count_dot(url: str) -> int:
return url.count('.')
def count_www(url: str) -> int:
return url.count('www')
def count_atrate(url: str) -> int:
return url.count('@')
def count_per(url: str) -> int:
return url.count('%')
def count_ques(url: str) -> int:
return url.count('?')
def count_hyphen(url: str) -> int:
return url.count('-')
def count_equal(url: str) -> int:
return url.count('=')
def count_slashes(url: str) -> int:
return url.count('/')
def count_double_slashes(url: str) -> int:
return url.count('//')
def sum_special_chars(url: str) -> int:
special_chars = "!#$%&()*+,/:;<=>?@[\\]^_`{|}~"
return sum(1 for char in url if char in special_chars)
def count_parameters(url: str) -> int:
parsed_url = urlparse(url)
return len(parsed_url.query.split('&')) if parsed_url.query else 0
def count_repeated_char(url: str) -> int:
return max([url.count(char) for char in set(url)])
def count_subdomains(url: str) -> int:
extracted_info = tldextract.extract(url)
subdomain = extracted_info.subdomain
return len(subdomain.split('.')) if subdomain else 0
def number_of_directories(url: str) -> int:
urldir = urlparse(url).path
return urldir.count('/')
def number_of_embedded(url: str) -> int:
urldir = urlparse(url).path
return urldir.count('//')
def get_url_length(url: str) -> int:
return len(url)
def get_domain_length(url: str) -> int:
parsed_url = urlparse(url)
domain = parsed_url.netloc or parsed_url.path.split('/')[0]
return len(domain.split(':')[0])
def get_path_length(url: str) -> int:
urlpath = urlparse(url).path
path_segments = urlpath.strip('/').split('/')
if path_segments:
return len(path_segments[0])
else:
return 0
def first_directory_length(url: str) -> int:
urlpath= urlparse(url).path
try:
return len(urlpath.split('/')[1])
except:
return 0
def check_accuracy(true_labels, predicted_labels, metric: str) -> None:
conf_matrix = confusion_matrix(true_labels, predicted_labels)
accuracy = accuracy_score(true_labels, predicted_labels)
balanced_accuracy = balanced_accuracy_score(true_labels, predicted_labels)
precision = precision_score(true_labels, predicted_labels, average='macro')
recall = recall_score(true_labels, predicted_labels, average='macro')
f1 = f1_score(true_labels, predicted_labels, average='macro')
metrics_text = (f"Accuracy: {accuracy:.2f}\n"
f"Balanced Accuracy: {balanced_accuracy:.2f}\n"
f"Precision: {precision:.2f}\n"
f"Recall: {recall:.2f}\n"
f"F1-Score: {f1:.2f}")
fig, ax = plt.subplots(1, 2, figsize=(12, 6))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues', cbar=True,
yticklabels=['Benign', 'Phishing', 'Defacement', 'Malware'],
xticklabels=['Benign', 'Phishing', 'Defacement', 'Malware'], ax=ax[0])
ax[0].set_title(f'Confusion Matrix for {metric} Model (Counts)')
ax[0].set_xlabel('Predicted Labels')
ax[0].set_ylabel('True Labels')
conf_matrix_normalized = conf_matrix.astype('float') / conf_matrix.sum(axis=1)[:, np.newaxis]
sns.heatmap(conf_matrix_normalized, annot=True, fmt='.2%', cmap='Blues', cbar=True,
yticklabels=['Benign', 'Phishing', 'Defacement', 'Malware'],
xticklabels=['Benign', 'Phishing', 'Defacement', 'Malware'], ax=ax[1])
ax[1].set_title(f'Confusion Matrix for {metric} Model (Percentage)')
ax[1].set_xlabel('Predicted Labels')
ax[1].set_ylabel('True Labels')
fig.text(0.5, -0.05, metrics_text, ha='center', va='center', fontsize=12)
plt.tight_layout()
plt.show()
Config¶
DIRECTORY_PATH = 'C:\\Users\\גיא\\OneDrive\\שולחן העבודה\\סדנה במדעי הנתונים\\malicious'
FILE_NAME = 'malicious_phish.csv'
TARGET = 'type'
SPLIT = 0.8
RANDOM_STATE = 2024
np.random.seed(RANDOM_STATE)
color_mapping = {
'benign': '#66c2a5',
'phishing': '#fc8d62',
'defacement': '#ffd92f',
'malware': '#b3b3b3',
}
labelsize = 8
fontsize = 10
categorical_columns = [ 'having_ip_address' ,'is_abnormal_url' ,'has_subdomain' ,'is_risky_tld',
'is_suspicious_suffix' ,'has_shortening_service' ,'has_port_number', 'contains_non_ascii', 'contains_suspicious_word']
numerical_columns = ['longest_digit_sequence', 'count_https', 'count_http', 'count_alpha', 'count_digits', 'count_hex_char',
'count_dot', 'count_@', 'count_%', 'count_?', 'count_-', 'count_=', 'count_/',
'count_//', 'sum_special_chars', 'count_parameters', 'count_repeated_char',
'count_subdomain', 'number_of_directories', 'number_of_embedded', 'url_length',
'domain_length', 'path_length', 'tld_length', 'first_directory_length',
'alpha_char_ratio', 'digit_char_ratio', 'special_char_ratio']
Loading Dataseta & Data description¶
raw_data = pd.read_csv(DIRECTORY_PATH+'\\'+FILE_NAME)
raw_data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 651191 entries, 0 to 651190 Data columns (total 2 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 url 651191 non-null object 1 type 651191 non-null object dtypes: object(2) memory usage: 9.9+ MB
raw_data.isna().sum()
url 0 type 0 dtype: int64
total_clients = len(raw_data.url)
unique_clients = len(raw_data.url.unique())
plt.figure(figsize=(8, 8))
plt.pie([unique_clients, total_clients - unique_clients], labels=['Unique Clients', 'Duplicate Clients'],
autopct='%1.1f%%', colors=['#ff9999','#66b3ff'], startangle=140, textprops={'fontsize': 14})
plt.title('Proportion of Unique vs. Duplicate Urls', fontsize=16)
plt.show()
# Remove duplicated url
df = raw_data[~raw_data.url.duplicated()].copy()
Feature Engineering: Extracting Features from URLs¶
Feature engineering is a crucial step in building models for URL classification. By extracting relevant features from URLs, we can enhance the model's ability to differentiate between benign and malicious URLs. Various types of features can be extracted, including lexical features, structural features, and semantic features.
Types of Features¶
Lexical Features: These features relate to the composition of the URL string itself. They include character counts, ratios, and the presence of specific characters or patterns.
- count_http: Counts the occurrences of the
httpprotocol in the URL. A higher count may indicate a preference for non-secure connections. - count_https: Counts the occurrences of the
httpsprotocol in the URL. More secure URLs may have a higher count, but attackers can also use HTTPS to disguise malicious intent. - count_alpha: Total number of alphabetic characters in the URL. Higher counts might indicate more complex URLs, which are often used in phishing.
- count_digits: Total number of numeric characters in the URL. Attackers may use numbers to obscure the URL's intent.
- count_hex_char: Count of hexadecimal characters present in the URL. This can indicate obfuscation techniques.
- count_dot: Number of dots (
.) in the URL. Attackers may use multiple dots to create subdomains or mislead users. - count_@: Count of the
@symbol in the URL, often used in email addresses or to obscure the true domain. - count_%: Number of percent signs (
%) in the URL, commonly used in encoding, which can signal attempts to disguise content. - count_?: Count of question marks (
?), indicating query parameters that can carry additional malicious payloads. - count_-: Count of hyphens (
-), often used in deceptive URL formations. - count_=: Number of equal signs (
=) in the URL, often seen in query strings. - count_/: Count of slashes (
/), which can indicate nested paths or complexity in URL structure. - count_//: Count of double slashes (
//), often indicating the start of a resource in the URL. - sum_special_chars: Total count of all special characters in the URL. A higher number can indicate an attempt to obfuscate the URL.
- count_parameters: Number of parameters in the URL. URLs with many parameters may be more likely to contain malicious content.
- count_repeated_char: Count of characters that are repeated consecutively, which can be a tactic used in phishing attempts.
- count_subdomain: Count of subdomains in the URL. Malicious URLs often use multiple subdomains to confuse users.
- count_http: Counts the occurrences of the
Structural Features: These features describe the structure and components of the URL.
- having_ip_address: Checks whether the URL contains an IP address instead of a domain name. Cyber attackers often use IP addresses to hide their identity.
- is_abnormal_url: Identifies URLs that exhibit characteristics that deviate from typical patterns, potentially indicating malicious intent.
- has_subdomain: Indicates the presence of a subdomain in the URL. Malicious URLs often utilize subdomains to mislead users.
- is_risky_tld: Flags if the URL has a top-level domain associated with higher risks, such as
.xyzor.info. - is_suspicious_suffix: Flags URLs with suffixes that are commonly used in phishing or malicious URLs.
- has_shortening_service: Indicates if the URL uses a URL shortening service (bit.ly), which can obscure the true destination.
- longest_digit_sequence: The length of the longest sequence of digits found in the URL. Longer sequences may indicate attempts to obfuscate.
- contains_non_ascii: Flags if the URL contains non-ASCII characters, which can be used in obfuscation tactics.
- has_port_number: Indicates if the URL specifies a port number, which may be uncommon for benign URLs.
Semantic Features: These features capture the meaning behind certain components of the URL.
- url_length: Total length of the URL. Attackers often use longer URLs to hide the domain name and mislead users.
- domain_length: Length of the domain part of the URL. Short or overly complex domains can signal potential threats.
- path_length: Length of the path in the URL. Longer paths can indicate attempts to confuse users.
- tld_length: Length of the top-level domain. Longer TLDs may be used to disguise malicious intent.
- first_directory_length: Length of the first directory in the URL path. Short or suspicious first directories can indicate potential threats.
- contains_suspicious_word: Flags the presence of suspicious words in the URL, which are often associated with phishing.
- alpha_char_ratio: Ratio of alphabetic characters to total characters, providing insight into the URL's composition.
- digit_char_ratio: Ratio of numeric characters to total characters, highlighting potential obfuscation.
- special_char_ratio: Ratio of special characters to total characters, indicating complexity and potential risk.
df['having_ip_address'] = df['url'].apply(lambda i: having_ip_address(i))
df['is_abnormal_url'] = df['url'].apply(lambda x: abnormal_url(x))
df['has_subdomain'] = df['url'].apply(has_subdomain)
df['is_risky_tld'] = df['url'].apply(lambda x: is_risky_tld(x))
df['is_suspicious_suffix'] = df['url'].apply(lambda x: is_suspicious_suffix(x))
df['has_shortening_service'] = df['url'].apply(lambda x: has_shortening_service(x))
df['longest_digit_sequence'] = df['url'].apply(lambda x: longest_digit_sequence(x))
df['contains_non_ascii'] = df['url'].apply(lambda x: contains_non_ascii(x))
df['has_port_number'] = df['url'].apply(lambda x: has_port_number(x))
df['count_http'] = df['url'].apply(lambda x: count_http(x))
df['count_https'] = df['url'].apply(lambda x: count_https(x))
df['count_alpha'] = df['url'].apply(lambda x: count_alpha(x))
df['count_digits'] = df['url'].apply(lambda x: count_digits(x))
df['count_hex_char'] = df['url'].apply(lambda x: count_hexadecimal_chars(x))
df['count_dot'] = df['url'].apply(lambda x: count_dot(x))
df['count_@'] = df['url'].apply(lambda x: count_atrate(x))
df['count_%'] = df['url'].apply(lambda x: count_per(x))
df['count_?'] = df['url'].apply(lambda x: count_ques(x))
df['count_-'] = df['url'].apply(lambda x: count_hyphen(x))
df['count_='] = df['url'].apply(lambda x: count_equal(x))
df['count_/'] = df['url'].apply(lambda x: count_slashes(x))
df['count_//'] = df['url'].apply(lambda x: count_double_slashes(x))
df["sum_special_chars"] = df['url'].apply(lambda x: sum_special_chars(x))
df['count_parameters'] = df['url'].apply(lambda x: count_parameters(x))
df['count_repeated_char'] = df['url'].apply(lambda x: count_repeated_char(x))
df['count_subdomain'] = df['url'].apply(lambda x: count_subdomains(x))
df['number_of_directories'] = df['url'].apply(lambda x: number_of_directories(x))
df['number_of_embedded'] = df['url'].apply(lambda x: number_of_embedded(x))
df['url_length'] = df['url'].apply(get_url_length)
df["domain_length"] = df["url"].apply(get_domain_length)
df['path_length'] = df['url'].apply(get_path_length)
df['tld_length'] = df['url'].apply(lambda x: len(extract_tld(x)))
df['first_directory_length'] = df['url'].apply(lambda x: first_directory_length(x))
df['contains_suspicious_word'] = df['url'].apply(lambda x: contains_suspicious_word(x))
df['alpha_char_ratio'] = df["count_alpha"] / df["url_length"]
df['digit_char_ratio'] = df["count_digits"] / df["url_length"]
df['special_char_ratio'] = df["sum_special_chars"] / df["url_length"]
df.set_index('url', inplace=True)
df.shape
(641119, 38)
df.tail(10)
| type | having_ip_address | is_abnormal_url | has_subdomain | is_risky_tld | is_suspicious_suffix | has_shortening_service | longest_digit_sequence | contains_non_ascii | has_port_number | ... | number_of_embedded | url_length | domain_length | path_length | tld_length | first_directory_length | contains_suspicious_word | alpha_char_ratio | digit_char_ratio | special_char_ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| url | |||||||||||||||||||||
| www.1up.com/do/gameOverview?cId=3159391 | phishing | 0 | 0 | 1 | False | 0 | 0 | 7 | 0 | 0 | ... | 0 | 39 | 11 | 11 | 3 | 2 | 0 | 0.641026 | 0.205128 | 0.102564 |
| psx.ign.com/articles/131/131835p1.html | phishing | 0 | 0 | 1 | False | 0 | 0 | 6 | 0 | 0 | ... | 0 | 38 | 11 | 11 | 3 | 8 | 0 | 0.578947 | 0.263158 | 0.078947 |
| wii.gamespy.com/wii/cursed-mountain/ | phishing | 0 | 0 | 1 | False | 0 | 0 | 0 | 0 | 0 | ... | 0 | 36 | 15 | 15 | 3 | 3 | 0 | 0.833333 | 0.000000 | 0.083333 |
| wii.ign.com/objects/142/14270799.html | phishing | 0 | 0 | 1 | False | 0 | 0 | 8 | 0 | 0 | ... | 0 | 37 | 11 | 11 | 3 | 7 | 0 | 0.540541 | 0.297297 | 0.081081 |
| xbox360.gamespy.com/xbox-360/dead-space/ | phishing | 0 | 0 | 1 | False | 0 | 0 | 3 | 0 | 0 | ... | 0 | 40 | 19 | 19 | 3 | 8 | 0 | 0.675000 | 0.150000 | 0.075000 |
| xbox360.ign.com/objects/850/850402.html | phishing | 0 | 0 | 1 | False | 0 | 0 | 6 | 0 | 0 | ... | 0 | 39 | 15 | 15 | 3 | 7 | 0 | 0.538462 | 0.307692 | 0.076923 |
| games.teamxbox.com/xbox-360/1860/Dead-Space/ | phishing | 0 | 0 | 1 | False | 0 | 1 | 4 | 0 | 0 | ... | 0 | 44 | 18 | 18 | 3 | 8 | 0 | 0.659091 | 0.159091 | 0.090909 |
| www.gamespot.com/xbox360/action/deadspace/ | phishing | 0 | 0 | 1 | False | 0 | 1 | 3 | 0 | 0 | ... | 0 | 42 | 16 | 16 | 3 | 7 | 0 | 0.785714 | 0.071429 | 0.095238 |
| en.wikipedia.org/wiki/Dead_Space_(video_game) | phishing | 0 | 0 | 1 | False | 0 | 0 | 0 | 0 | 0 | ... | 0 | 45 | 16 | 16 | 3 | 4 | 0 | 0.800000 | 0.000000 | 0.155556 |
| www.angelfire.com/goth/devilmaycrytonite/ | phishing | 0 | 0 | 1 | False | 0 | 0 | 0 | 0 | 0 | ... | 0 | 41 | 17 | 17 | 3 | 4 | 0 | 0.878049 | 0.000000 | 0.073171 |
10 rows × 38 columns
df[df.select_dtypes(include=[bool]).columns] = df.select_dtypes(include=[bool]).astype(int)
df.info()
<class 'pandas.core.frame.DataFrame'> Index: 641119 entries, br-icloud.com.br to www.angelfire.com/goth/devilmaycrytonite/ Data columns (total 38 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 type 641119 non-null object 1 having_ip_address 641119 non-null int64 2 is_abnormal_url 641119 non-null int64 3 has_subdomain 641119 non-null int64 4 is_risky_tld 641119 non-null int32 5 is_suspicious_suffix 641119 non-null int64 6 has_shortening_service 641119 non-null int64 7 longest_digit_sequence 641119 non-null int64 8 contains_non_ascii 641119 non-null int64 9 has_port_number 641119 non-null int64 10 count_http 641119 non-null int64 11 count_https 641119 non-null int64 12 count_alpha 641119 non-null int64 13 count_digits 641119 non-null int64 14 count_hex_char 641119 non-null int64 15 count_dot 641119 non-null int64 16 count_@ 641119 non-null int64 17 count_% 641119 non-null int64 18 count_? 641119 non-null int64 19 count_- 641119 non-null int64 20 count_= 641119 non-null int64 21 count_/ 641119 non-null int64 22 count_// 641119 non-null int64 23 sum_special_chars 641119 non-null int64 24 count_parameters 641119 non-null int64 25 count_repeated_char 641119 non-null int64 26 count_subdomain 641119 non-null int64 27 number_of_directories 641119 non-null int64 28 number_of_embedded 641119 non-null int64 29 url_length 641119 non-null int64 30 domain_length 641119 non-null int64 31 path_length 641119 non-null int64 32 tld_length 641119 non-null int64 33 first_directory_length 641119 non-null int64 34 contains_suspicious_word 641119 non-null int64 35 alpha_char_ratio 641119 non-null float64 36 digit_char_ratio 641119 non-null float64 37 special_char_ratio 641119 non-null float64 dtypes: float64(3), int32(1), int64(33), object(1) memory usage: 188.3+ MB
df.select_dtypes(include=['O']).columns
Index(['type'], dtype='object')
df.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| having_ip_address | 641119.0 | 0.019461 | 0.138140 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
| is_abnormal_url | 641119.0 | 0.277557 | 0.447794 | 0.0 | 0.000000 | 0.000000 | 1.000000 | 1.000000 |
| has_subdomain | 641119.0 | 0.390959 | 0.487966 | 0.0 | 0.000000 | 0.000000 | 1.000000 | 1.000000 |
| is_risky_tld | 641119.0 | 0.032602 | 0.177594 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
| is_suspicious_suffix | 641119.0 | 0.045665 | 0.208759 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
| has_shortening_service | 641119.0 | 0.061483 | 0.240215 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
| longest_digit_sequence | 641119.0 | 2.452490 | 3.280998 | 0.0 | 0.000000 | 1.000000 | 4.000000 | 133.000000 |
| contains_non_ascii | 641119.0 | 0.001427 | 0.037751 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
| has_port_number | 641119.0 | 0.007722 | 0.087537 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
| count_http | 641119.0 | 0.285666 | 0.466025 | 0.0 | 0.000000 | 0.000000 | 1.000000 | 9.000000 |
| count_https | 641119.0 | 0.025992 | 0.162631 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 5.000000 |
| count_alpha | 641119.0 | 45.179165 | 31.735030 | 0.0 | 25.000000 | 37.000000 | 58.000000 | 2141.000000 |
| count_digits | 641119.0 | 5.371986 | 11.630365 | 0.0 | 0.000000 | 2.000000 | 6.000000 | 1204.000000 |
| count_hex_char | 641119.0 | 0.397326 | 4.165907 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 231.000000 |
| count_dot | 641119.0 | 2.193950 | 1.491449 | 0.0 | 1.000000 | 2.000000 | 3.000000 | 42.000000 |
| count_@ | 641119.0 | 0.002243 | 0.054507 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 10.000000 |
| count_% | 641119.0 | 0.398489 | 4.166377 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 231.000000 |
| count_? | 641119.0 | 0.221391 | 0.440003 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 20.000000 |
| count_- | 641119.0 | 1.561364 | 2.984744 | 0.0 | 0.000000 | 0.000000 | 2.000000 | 87.000000 |
| count_= | 641119.0 | 0.591642 | 1.491306 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 51.000000 |
| count_/ | 641119.0 | 2.921902 | 1.895781 | 0.0 | 2.000000 | 3.000000 | 4.000000 | 41.000000 |
| count_// | 641119.0 | 0.281310 | 0.456609 | 0.0 | 0.000000 | 0.000000 | 1.000000 | 9.000000 |
| sum_special_chars | 641119.0 | 5.436424 | 6.528653 | 0.0 | 2.000000 | 4.000000 | 6.000000 | 367.000000 |
| count_parameters | 641119.0 | 0.578999 | 1.471207 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 51.000000 |
| count_repeated_char | 641119.0 | 6.565156 | 5.371619 | 1.0 | 4.000000 | 5.000000 | 8.000000 | 588.000000 |
| count_subdomain | 641119.0 | 0.496700 | 1.007248 | 0.0 | 0.000000 | 0.000000 | 1.000000 | 33.000000 |
| number_of_directories | 641119.0 | 2.310321 | 1.566776 | 0.0 | 1.000000 | 2.000000 | 3.000000 | 39.000000 |
| number_of_embedded | 641119.0 | 0.001529 | 0.039543 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 2.000000 |
| url_length | 641119.0 | 59.762470 | 44.894590 | 1.0 | 32.000000 | 47.000000 | 76.000000 | 2175.000000 |
| domain_length | 641119.0 | 17.403839 | 11.360269 | 0.0 | 12.000000 | 16.000000 | 20.000000 | 248.000000 |
| path_length | 641119.0 | 15.336243 | 12.896271 | 0.0 | 9.000000 | 13.000000 | 18.000000 | 304.000000 |
| tld_length | 641119.0 | 2.986825 | 0.904940 | 0.0 | 3.000000 | 3.000000 | 3.000000 | 18.000000 |
| first_directory_length | 641119.0 | 8.527999 | 11.064798 | 0.0 | 4.000000 | 6.000000 | 9.000000 | 408.000000 |
| contains_suspicious_word | 641119.0 | 0.076884 | 0.266408 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
| alpha_char_ratio | 641119.0 | 0.777544 | 0.116276 | 0.0 | 0.735294 | 0.800000 | 0.857143 | 1.000000 |
| digit_char_ratio | 641119.0 | 0.070980 | 0.100210 | 0.0 | 0.000000 | 0.031250 | 0.104651 | 1.000000 |
| special_char_ratio | 641119.0 | 0.084431 | 0.045670 | 0.0 | 0.052632 | 0.078947 | 0.111111 | 0.535714 |
Outlier Detection:¶
We observe some extremely out-of-range values as outliers, such as in the features url_length, path_length, domain_length, count_repeated_char, and first_directory_length.
For example, the url_length feature exhibited significant extreme values, as indicated by the descriptive statistics:
- 25th Percentile: 32
- 75th Percentile: 76
- Maximum Length: 2175
The maximum value of 2175 is notably high compared to the interquartile range, suggesting the presence of extreme outliers. By applying a filter to remove values exceeding the 95th percentile for url_length, we aim to mitigate the influence of these extreme cases on our analysis and model performance. This step is crucial for enhancing the robustness of our results while ensuring that the majority of meaningful observations are retained.
Moreover, removing outliers from the **`url_lengthshouldure may also help address the remaining outlier features, as they could be correlated and influenced by similar patterns in the data. retained. tained. .
extreme_values = df[ df['url_length'] < df['url_length'].quantile(.95)]
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(10, 7))
df_labels = df['type'].value_counts()
extreme_labels = extreme_values['type'].value_counts()
ax[0].pie(df_labels, labels=df_labels.index, autopct='%1.1f%%', startangle=90,
textprops={'fontsize': labelsize}, colors=[color_mapping[label] for label in df_labels.index])
ax[0].set_title(f'Orignal Distribution', fontsize=fontsize, y=-0.1)
ax[1].pie(extreme_labels, labels=extreme_labels.index, autopct='%1.1f%%', startangle=90,
textprops={'fontsize': labelsize}, colors=[color_mapping[label] for label in extreme_labels.index])
ax[1].set_title(f'Extreme Distribution', fontsize=fontsize, y=-0.1)
fig.suptitle(f'\n URL Types Distribution', fontsize=18, y=1.02)
plt.tight_layout()
plt.subplots_adjust(top=0.85)
plt.show()
Outlier Handling:¶
During the analysis, it was observed that the url_length feature contained extreme outliers that could distort the results. To address this, a threshold was applied to remove the outliers by filtering out URLs with a length greater than the 95th percentile. This step ensures cleaner data without distorting the target distribution, as the distribution of the target variable remained consistent before and after the removal of outliers. Thus, removing these outliers does not negatively affect the overall target distribution or the model's ability to generalize.
df = df[ df['url_length'] < df['url_length'].quantile(.95)]
df.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| having_ip_address | 608510.0 | 0.020292 | 0.140998 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
| is_abnormal_url | 608510.0 | 0.258231 | 0.437662 | 0.0 | 0.000000 | 0.000000 | 1.000000 | 1.000000 |
| has_subdomain | 608510.0 | 0.382896 | 0.486094 | 0.0 | 0.000000 | 0.000000 | 1.000000 | 1.000000 |
| is_risky_tld | 608510.0 | 0.032014 | 0.176038 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
| is_suspicious_suffix | 608510.0 | 0.046717 | 0.211033 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
| has_shortening_service | 608510.0 | 0.061628 | 0.240478 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
| longest_digit_sequence | 608510.0 | 2.304192 | 3.078039 | 0.0 | 0.000000 | 1.000000 | 4.000000 | 81.000000 |
| contains_non_ascii | 608510.0 | 0.001173 | 0.034234 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
| has_port_number | 608510.0 | 0.008113 | 0.089708 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
| count_http | 608510.0 | 0.263940 | 0.449999 | 0.0 | 0.000000 | 0.000000 | 1.000000 | 4.000000 |
| count_https | 608510.0 | 0.023904 | 0.154548 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 3.000000 |
| count_alpha | 608510.0 | 40.403134 | 20.975018 | 0.0 | 25.000000 | 35.000000 | 53.000000 | 125.000000 |
| count_digits | 608510.0 | 3.870893 | 5.847641 | 0.0 | 0.000000 | 1.000000 | 6.000000 | 89.000000 |
| count_hex_char | 608510.0 | 0.141838 | 1.453663 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 37.000000 |
| count_dot | 608510.0 | 2.085859 | 1.161088 | 0.0 | 1.000000 | 2.000000 | 3.000000 | 28.000000 |
| count_@ | 608510.0 | 0.001755 | 0.044076 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 6.000000 |
| count_% | 608510.0 | 0.142676 | 1.454720 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 37.000000 |
| count_? | 608510.0 | 0.194409 | 0.418829 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 20.000000 |
| count_- | 608510.0 | 1.419641 | 2.711581 | 0.0 | 0.000000 | 0.000000 | 1.000000 | 41.000000 |
| count_= | 608510.0 | 0.454431 | 1.195632 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 17.000000 |
| count_/ | 608510.0 | 2.845536 | 1.832456 | 0.0 | 1.000000 | 3.000000 | 4.000000 | 28.000000 |
| count_// | 608510.0 | 0.261761 | 0.445712 | 0.0 | 0.000000 | 0.000000 | 1.000000 | 3.000000 |
| sum_special_chars | 608510.0 | 4.699941 | 4.180913 | 0.0 | 2.000000 | 4.000000 | 6.000000 | 42.000000 |
| count_parameters | 608510.0 | 0.450476 | 1.184976 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 17.000000 |
| count_repeated_char | 608510.0 | 5.895251 | 2.971220 | 1.0 | 4.000000 | 5.000000 | 7.000000 | 70.000000 |
| count_subdomain | 608510.0 | 0.444137 | 0.709705 | 0.0 | 0.000000 | 0.000000 | 1.000000 | 26.000000 |
| number_of_directories | 608510.0 | 2.276585 | 1.508885 | 0.0 | 1.000000 | 2.000000 | 3.000000 | 28.000000 |
| number_of_embedded | 608510.0 | 0.001548 | 0.039772 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 2.000000 |
| url_length | 608510.0 | 52.489627 | 27.845507 | 1.0 | 31.000000 | 45.000000 | 70.000000 | 133.000000 |
| domain_length | 608510.0 | 16.689652 | 7.098847 | 0.0 | 12.000000 | 16.000000 | 20.000000 | 132.000000 |
| path_length | 608510.0 | 14.731087 | 8.562510 | 0.0 | 9.000000 | 14.000000 | 18.000000 | 132.000000 |
| tld_length | 608510.0 | 2.980912 | 0.892058 | 0.0 | 3.000000 | 3.000000 | 3.000000 | 18.000000 |
| first_directory_length | 608510.0 | 8.239087 | 9.639157 | 0.0 | 4.000000 | 6.000000 | 9.000000 | 125.000000 |
| contains_suspicious_word | 608510.0 | 0.064053 | 0.244848 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
| alpha_char_ratio | 608510.0 | 0.781951 | 0.113515 | 0.0 | 0.739130 | 0.805556 | 0.859155 | 1.000000 |
| digit_char_ratio | 608510.0 | 0.066333 | 0.096428 | 0.0 | 0.000000 | 0.025000 | 0.098039 | 1.000000 |
| special_char_ratio | 608510.0 | 0.083615 | 0.043625 | 0.0 | 0.052632 | 0.078947 | 0.111111 | 0.535714 |
Post-Outlier Removal Analysis:¶
After removing URLs with a url_length greater than the 95th percentile, the descriptive statistics show a much more reasonable range. This adjustment has minimized extreme values, resulting in a dataset that better reflects typical URL characteristics and enhances the reliability of our analysis and model performance.
original_count = 641119.0
new_count = 608510.0
percentage_loss = np.round((original_count - new_count) / original_count * 100, 2)
print(f"By removing the outliers, we lost {percentage_loss}% of the data.")
By removing the outliers, we lost 5.09% of the data.
Data Splitting: Train and Test¶
In this project, we are working on a machine learning model. The data is split into two sets:
Training Set (80%): This set will be used to train the model. The model learns from this data by identifying patterns and relationships between the input features and the target variable.
Test Set (20%): This set is reserved to evaluate the performance of the trained model. By testing on unseen data, we can check the model's generalization ability, ensuring that it works well not only on the training data but also on new, unseen data.
The 80/20 split is a common practice, balancing the amount of data available for training while keeping enough data to validate the model's performance effectively.
Model Performance Assessment¶
After training the model on the training set, we will assess its performance on the test set using relevant evaluation metrics. This will help us determine how well the model can generalize to new, unseen data and avoid overfitting to the training data.
split = int(len(df) * SPLIT) # SPLIT = 0.8
df = shuffle(df, random_state=RANDOM_STATE)
train, test = df.iloc[:split], df.iloc[split:]
Exploratory Data Analysis (EDA)¶
Target (type) Distribution¶
unique_labels = sorted(set(train['type'].unique()).union(set(test['type'].unique())))
palette_dict = {label: color_mapping[label] for label in unique_labels}
fig, ax = plt.subplots(nrows=2, ncols=2, figsize=(10, 14))
train_labels = train['type'].value_counts()
ax[0, 0].pie(train_labels, labels=train_labels.index, autopct='%1.1f%%', startangle=90,
textprops={'fontsize': labelsize}, colors=[palette_dict[label] for label in train_labels.index])
ax[0, 0].set_title(f'Train Distribution', fontsize=fontsize, y=-0.1)
test_labels = test['type'].value_counts()
ax[0, 1].pie(test_labels, labels=test_labels.index, autopct='%1.1f%%', startangle=90,
textprops={'fontsize': labelsize}, colors=[palette_dict[label] for label in test_labels.index])
ax[0, 1].set_title(f'Test Distribution', fontsize=fontsize, y=-0.1)
sns.countplot(x='type', data=train, ax=ax[1, 0], order=unique_labels,
palette=[palette_dict[label] for label in unique_labels])
ax[1, 0].set_title('Train Countplot', fontsize=fontsize)
sns.countplot(x='type', data=test, ax=ax[1, 1], order=unique_labels,
palette=[palette_dict[label] for label in unique_labels])
ax[1, 1].set_title('Test Countplot', fontsize=fontsize)
fig.suptitle(f'\n URL Types Distribution', fontsize=18, y=1.02)
plt.tight_layout()
plt.subplots_adjust(top=0.92)
plt.show()
Generate the WordCloud on Training Data¶
benign_url = ' '.join(train[train['type'] == 'benign'].index)
phishing_url = ' '.join(train[train['type'] == 'phishing'].index)
defacement_url = ' '.join(train[train['type'] == 'defacement'].index)
malware_url = ' '.join(train[train['type'] == 'malware'].index)
wordcloud_benign = WordCloud(width=800, height=400, background_color='white').generate(benign_url)
wordcloud_phishing = WordCloud(width=800, height=400, background_color='white').generate(phishing_url)
wordcloud_defacement = WordCloud(width=800, height=400, background_color='white').generate(defacement_url)
wordcloud_malware = WordCloud(width=800, height=400, background_color='white').generate(malware_url)
fig, axs = plt.subplots(2, 2, figsize=(10, 6))
axs[0, 0].imshow(wordcloud_benign, interpolation='bilinear')
axs[0, 0].set_title('Benign URLs')
axs[0, 0].axis('off')
axs[0, 1].imshow(wordcloud_phishing, interpolation='bilinear')
axs[0, 1].set_title('Phishing URLs')
axs[0, 1].axis('off')
axs[1, 0].imshow(wordcloud_defacement, interpolation='bilinear')
axs[1, 0].set_title('php URLs')
axs[1, 0].axis('off')
axs[1, 1].imshow(wordcloud_malware, interpolation='bilinear')
axs[1, 1].set_title('Malware URLs')
axs[1, 1].axis('off')
plt.tight_layout()
plt.show()
Create Features Based on the WordCloud Insights¶
train = train.reset_index()
test = test.reset_index()
# For Malware URLs
train['exe_in_url'] = train['url'].apply(lambda x: 1 if 'exe' in x.lower() else 0)
train['mozi_in_url'] = train['url'].apply(lambda x: 1 if 'mozi' in x.lower() else 0)
train['jp_in_url'] = train['url'].apply(lambda x: 1 if 'jp' in x.lower() else 0)
train['mitsui_in_url'] = train['url'].apply(lambda x: 1 if 'mitsui' in x.lower() else 0)
train['mixh_in_url'] = train['url'].apply(lambda x: 1 if 'mixh' in x.lower() else 0)
test['exe_in_url'] = test['url'].apply(lambda x: 1 if 'exe' in x.lower() else 0)
test['mozi_in_url'] = test['url'].apply(lambda x: 1 if 'mozi' in x.lower() else 0)
test['jp_in_url'] = test['url'].apply(lambda x: 1 if 'jp' in x.lower() else 0)
test['mitsui_in_url'] = test['url'].apply(lambda x: 1 if 'mitsui' in x.lower() else 0)
test['mixh_in_url'] = test['url'].apply(lambda x: 1 if 'mixh' in x.lower() else 0)
# For Phishing URLs
train['ietf_in_url'] = train['url'].apply(lambda x: 1 if 'ietf' in x.lower() else 0)
train['tools_in_url'] = train['url'].apply(lambda x: 1 if 'tools' in x.lower() else 0)
test['ietf_in_url'] = test['url'].apply(lambda x: 1 if 'ietf' in x.lower() else 0)
test['tools_in_url'] = test['url'].apply(lambda x: 1 if 'tools' in x.lower() else 0)
# For Defacement URLs
train['index_in_url'] = train['url'].apply(lambda x: 1 if 'index' in x.lower() else 0)
train['com_content_in_url'] = train['url'].apply(lambda x: 1 if 'com_content' in x.lower() else 0)
train['option_in_url'] = train['url'].apply(lambda x: 1 if 'option' in x.lower() else 0)
train['php_in_url'] = train['url'].apply(lambda x: 1 if 'php' in x.lower() else 0)
test['index_in_url'] = test['url'].apply(lambda x: 1 if 'index' in x.lower() else 0)
test['com_content_in_url'] = test['url'].apply(lambda x: 1 if 'com_content' in x.lower() else 0)
test['option_in_url'] = test['url'].apply(lambda x: 1 if 'option' in x.lower() else 0)
test['php_in_url'] = test['url'].apply(lambda x: 1 if 'php' in x.lower() else 0)
train.set_index('url', inplace=True)
test.set_index('url', inplace=True)
print( train.shape, test.shape )
(486808, 49) (121702, 49)
Word Cloud Features Analysis¶
unique_labels = sorted(set(train['type'].unique()).union(set(test['type'].unique())))
palette_dict = {label: color_mapping[label] for label in unique_labels}
word_cloud_columns = ['exe_in_url', 'mozi_in_url', 'jp_in_url', 'mitsui_in_url', 'mixh_in_url' ,
'ietf_in_url', 'tools_in_url',
'index_in_url', 'com_content_in_url', 'option_in_url', 'php_in_url']
for col in word_cloud_columns:
fig, ax = plt.subplots(nrows=2, ncols=2, figsize=(10, 7))
fig.subplots_adjust(hspace=0.8, wspace=0.2)
counts = pd.crosstab(train[col], train['type'])
counts.plot(kind='bar', stacked=True, ax=ax[0, 0], color=[palette_dict[label] for label in counts.columns])
ax[0, 0].set_title(f'Bar Plot of URL Types Based on {col}', fontsize=fontsize)
ax[0, 0].set_xlabel(f'Use {col} (0 = No, 1 = Yes)', fontsize=fontsize)
ax[0, 0].set_ylabel('Count', fontsize=fontsize)
ax[0, 0].tick_params(axis='x', rotation=45, labelsize=labelsize)
ax[0, 0].tick_params(axis='y', labelsize=labelsize)
ax[0, 0].legend(unique_labels, loc='upper right')
type_counts = pd.crosstab(train[col], train['type'], normalize='index')
type_counts.plot(kind='bar', ax=ax[0, 1], color=[palette_dict[label] for label in type_counts.columns])
ax[0, 1].set_title(f'Proportional Distribution of URL Types Based on {col}', fontsize=fontsize)
ax[0, 1].set_ylabel('Proportion (%)', fontsize=fontsize)
ax[0, 1].set_xlabel(f'Use {col} (0 = No, 1 = Yes)', fontsize=fontsize)
ax[0, 1].tick_params(axis='x', rotation=45, labelsize=labelsize)
ax[0, 1].tick_params(axis='y', labelsize=labelsize)
ax[0, 1].legend(unique_labels, loc='upper right')
val_0 = train[train[col] == 0]['type'].value_counts()
val_1 = train[train[col] == 1]['type'].value_counts()
ax[1, 0].pie(val_0, labels=val_0.index, autopct='%1.1f%%', startangle=90,
textprops={'fontsize': labelsize}, colors=[palette_dict[label] for label in val_0.index])
ax[1, 0].set_title(f'URL Type Distribution ({col} = 0)', fontsize=fontsize, y=-0.1)
ax[1, 1].pie(val_1, labels=val_1.index, autopct='%1.1f%%', startangle=90,
textprops={'fontsize': labelsize}, colors=[palette_dict[label] for label in val_1.index])
ax[1, 1].set_title(f'URL Type Distribution ({col} = 1)', fontsize=fontsize, y=-0.1)
fig.suptitle(f'\nAnalysis of URL Types Distribution of: {col}', fontsize=18, y=1.02)
plt.tight_layout()
plt.subplots_adjust(top=0.85)
plt.show()
Word Cloud Feature Analysis Results:¶
- Malware Detection:
- The following features were critical in increasing the detection of Malware URLs:
- 'exe_in_url': Presence of 'exe' in the URL was a strong indicator of malware.
- 'mozi_in_url': URLs containing 'mozi' showed a significant correlation with malware.
- 'jp_in_url': 'jp' in the URL flagged a substantial number of malicious URLs.
- 'mitsui_in_url': The presence of 'mitsui' led to higher malware detection rates.
- 'mixh_in_url': URLs with 'mixh' contributed to the enhanced identification of malware.
- The following features were critical in increasing the detection of Malware URLs:
- Phishing Detection:
- The following features were effective in improving the detection of Phishing URLs:
- 'ietf_in_url': The presence of 'ietf' significantly raised the detection of phishing attempts.
- 'tools_in_url': URLs with 'tools' had a strong association with phishing activity.
- The following features were effective in improving the detection of Phishing URLs:
- Defacement Detection:
- The following features were highly influential in detecting URLs associated with Defacement:
- 'index_in_url': 'index' in the URL indicated PHP-related URLs more frequently.
- 'com_content_in_url': This feature significantly increased the detection of URLs with 'com_content', related to PHP-based sites.
- 'option_in_url': The presence of 'option' in the URL improved identification of PHP URLs.
- 'php_in_url': Naturally, the feature 'php_in_url' was very effective in detecting PHP-based URLs.
- The following features were highly influential in detecting URLs associated with Defacement:
Binary Features Analysis¶
unique_labels = sorted(set(train['type'].unique()).union(set(test['type'].unique())))
palette_dict = {label: color_mapping[label] for label in unique_labels}
for col in categorical_columns:
fig, ax = plt.subplots(nrows=2, ncols=2, figsize=(10, 7))
fig.subplots_adjust(hspace=0.8, wspace=0.2)
counts = pd.crosstab(train[col], train['type'])
counts.plot(kind='bar', stacked=True, ax=ax[0, 0], color=[palette_dict[label] for label in counts.columns])
ax[0, 0].set_title(f'Bar Plot of URL Types Based on {col}', fontsize=fontsize)
ax[0, 0].set_xlabel(f'Use {col} (0 = No, 1 = Yes)', fontsize=fontsize)
ax[0, 0].set_ylabel('Count', fontsize=fontsize)
ax[0, 0].tick_params(axis='x', rotation=45, labelsize=labelsize)
ax[0, 0].tick_params(axis='y', labelsize=labelsize)
ax[0, 0].legend(unique_labels, loc='upper right')
type_counts = pd.crosstab(train[col], train['type'], normalize='index')
type_counts.plot(kind='bar', ax=ax[0, 1], color=[palette_dict[label] for label in type_counts.columns])
ax[0, 1].set_title(f'Proportional Distribution of URL Types Based on {col}', fontsize=fontsize)
ax[0, 1].set_ylabel('Proportion (%)', fontsize=fontsize)
ax[0, 1].set_xlabel(f'Use {col} (0 = No, 1 = Yes)', fontsize=fontsize)
ax[0, 1].tick_params(axis='x', rotation=45, labelsize=labelsize)
ax[0, 1].tick_params(axis='y', labelsize=labelsize)
ax[0, 1].legend(unique_labels, loc='upper right')
val_0 = train[train[col] == 0]['type'].value_counts()
val_1 = train[train[col] == 1]['type'].value_counts()
ax[1, 0].pie(val_0, labels=val_0.index, autopct='%1.1f%%', startangle=90,
textprops={'fontsize': labelsize}, colors=[palette_dict[label] for label in val_0.index])
ax[1, 0].set_title(f'URL Type Distribution ({col} = 0)', fontsize=fontsize, y=-0.1)
ax[1, 1].pie(val_1, labels=val_1.index, autopct='%1.1f%%', startangle=90,
textprops={'fontsize': labelsize}, colors=[palette_dict[label] for label in val_1.index])
ax[1, 1].set_title(f'URL Type Distribution ({col} = 1)', fontsize=fontsize, y=-0.1)
fig.suptitle(f'\nAnalysis of URL Types Distribution of: {col}', fontsize=18, y=1.02)
plt.tight_layout()
plt.subplots_adjust(top=0.85)
plt.show()
Additional Binary Feature Analysis Results:¶
- Malware Detection:
- 'having_ip_address = 1': URLs that contain an IP address strongly increase the likelihood of being classified as Malware URLs. However, there are relatively few such observations.
- 'has_port_number = 1': URLs with a port number are strongly associated with Malware URLs, but the number of such occurrences is low.
- 'is_suspicious_suffix = 1': URLs with suspicious suffixes (like uncommon or unusual domain extensions) show a significant increase in the likelihood of being detected as Malware URLs.
- Defacement Detection:
- 'is_abnormal_url = 1': Abnormal URLs are highly correlated with Defacement URLs and slightly increase the likelihood of detecting Malware URLs.
- 'contains_non_ascii = 1': URLs containing non-ASCII characters are strongly linked to Defacement URLs, although these cases are rare.
- 'is_abnormal_url = 1': URLs classified as abnormal significantly increase the likelihood of being detected as both Phishing and Defacement URLs.
- Phishing Detection:
- 'is_risky_tld = 1': URLs with risky top-level domains (TLDs) increase the likelihood of detecting both Phishing URLs and Defacement URLs, but there are very few observations.
- 'contains_suspicious_word = 1': The presence of suspicious words in the URL is strongly correlated with Phishing URLs.
- No Significant Change:
- 'has_shortening_service = 1': URLs using shortening services did not show a significant change in the distribution or detection rates for any specific type of malicious URL.
malicious URL. cific URL type.
Numerical Features Analysis¶
n_rows = 14
n_cols = 2
fig, axes = plt.subplots(n_rows, n_cols, figsize=(10, 36))
axes = axes.flatten()
for i, col in enumerate(numerical_columns):
ax = axes[i]
sns.violinplot(data=train, x='type', y=col, palette='Set2', ax=ax)
ax.set_title(f'Analysis of URL Types Distribution of: {col}', fontsize=10)
ax.set_xlabel('Type', fontsize=8)
ax.set_ylabel(col, fontsize=8)
ax.grid(True)
plt.tight_layout()
plt.show()
Analysis of 'count_@':¶
- Overall Summary: The feature
'count_@'shows negligible variation across different URL types. Almost all the observations for this feature have a value of zero, regardless of whether the URL is benign, phishing, defacement, or malware. - Conclusion: Since this feature is almost always zero across all types of URLs, it provides little to no useful information for distinguishing between different types of URLs. Therefore,
'count_@'can be dropped from further analysis without impacting the model's performance.
Analysis of 'number_of_embedded':¶
- Overall Summary: The feature
'number_of_embedded'also exhibits very little variation. Nearly all URLs, across all types (benign, phishing, defacement, malware), have zero embedded elements. There is hardly any differentiation between URL types based on this feature. - Conclusion: Given the lack of variation and low discriminatory power of the
'number_of_embedded'feature, it can be safely removed from the dataset. Retaining this feature is unlikely to contribute any meaningful insights or improve model accuracy.
Analysis of 'digit_char_ratio', 'alpha_char_ratio', 'url_length', 'count_alpha', 'count_digits', 'special_char_ratio', and 'sum_special_chars':¶
Overall Summary: Each of these features —
'digit_char_ratio','alpha_char_ratio','url_length','count_alpha','count_digits','special_char_ratio', and'sum_special_chars'— displays useful patterns across different URL types. These features provide critical insights into the structure and composition of URLs, which helps in distinguishing between benign, phishing, defacement, and malware URLs.Feature Interdependence:
- 'url_length', 'count_alpha', and 'count_digits' are fundamental metrics that capture the length of the URL and the counts of alphabetic and numeric characters. Malicious URLs may use unusual character distributions, which makes these features particularly informative.
- 'digit_char_ratio' and 'alpha_char_ratio' are derived from the raw counts, capturing the proportion of numeric and alphabetic characters in the URL. These ratios are complementary to the absolute counts, providing a more nuanced view of how certain character types dominate different types of URLs.
- 'special_char_ratio' and 'sum_special_chars' specifically focus on special characters (like punctuation or symbols). Given that the presence of special characters often indicates deliberate URL obfuscation, especially in phishing or malware URLs, both the raw count and the ratio relative to the URL length provide valuable signals.
Conclusion: Both the raw counts (
'url_length','count_alpha','count_digits','sum_special_chars') and their corresponding ratios ('digit_char_ratio','alpha_char_ratio','special_char_ratio') are highly informative features. They provide complementary views of URL composition and contribute significantly to identifying potential malicious URLs. Therefore, all of these features should be retained in the analysis as they offer substantial predictiv
Analysis of repeated_char_ratio:¶
Upon analyzing the distribution of 'count_repeated_char', it appears to be a useful feature for distinguishing between URL types. This makes sense, as repeated characters can often be indicative of obfuscation in malicious URLs
To enhance the feature further, we will create a new feature, 'repeated_char_ratio', by dividing 'count_repeated_char' by 'url_length'. This will normalize the count of repeated characters based on the length of the URL, allowing us to capture meaningful patterns across URLs of different lengt
Conclusion for Remaining Features:¶
For the remaining features such as 'longest_digit_sequence', 'count_https', 'count_http', 'count_dot', 'count_%', 'count_?', 'count_-', 'count_=', 'count_/', 'count_//', 'count_parameters', 'count_subdomain', 'number_of_directories', 'domain_length', 'path_length', 'tld_length', and 'first_directory_length', we observe that some show slight differences in distributions across URL types.
While these features don't exhibit as pronounced separation as others, they are still logically relevant to the problem of identifying different URL types. For instance:
- The counts of specific characters (
'count_https','count_dot','count_/','count_parameters') are important since they can hint at URL structure differences. - Features like
'longest_digit_sequence','number_of_directories','domain_length', and'path_length'may help distinguish more complex URLs, particularly in malicious cases like phishing and malware.
Although the separation is subtle, these features likely contribute useful information when combined with others, especially in more advanced models. Therefore, we will retain these features for further analysis and potential model input. s. hs. e power.
# Drop Features ('count_@', 'number_of_embedded') with very little variance and therefore, provide minimal information.
train.drop(['count_@', 'number_of_embedded'], axis=1, inplace=True)
test.drop(['count_@', 'number_of_embedded'], axis=1, inplace=True)
numerical_columns.remove('count_@')
numerical_columns.remove('number_of_embedded')
# Create new features (repeated_char_ratio):
train['repeated_char_ratio'] = train['count_repeated_char'] / train['url_length']
test['repeated_char_ratio'] = test['count_repeated_char'] / test['url_length']
numerical_columns.append('repeated_char_ratio')
Feature Selection + Statistical Test¶
What is Feature Selection?¶
Feature selection is the process of identifying and selecting the most relevant features (or variables) for use in machine learning models. By choosing the most important features, we aim to reduce the dimensionality of the data, which can lead to better model performance, lower computational cost, and improved interpretability.
Purpose of Feature Selection¶
- Improves Model Performance: Irrelevant or redundant features can introduce noise and decrease the performance of a model. Selecting the right set of features helps in building models that generalize better.
- Reduces Overfitting: By removing irrelevant features, the model is less likely to memorize the training data, which reduces the risk of overfitting and improves performance on unseen data.
- Increases Model Interpretability: Fewer features make models simpler to understand and interpret, especially in real-world applications.
- Decreases Computational Cost: Working with fewer features reduces training time and makes the model more efficient, which is especially important when working with large datasets.
Workflow¶
Step 1: Statistical Test Using Linear Methods¶
We start by applying linear feature selection methods to filter out irrelevant features. The two methods used are:
- ANOVA F-Test (f_classif): This test is applied to numerical features to check if they have a linear relationship with the target variable.
- Chi-Square Test: This is used for binary/categorical features to assess the relevance of each feature to the target.
These tests provide an initial filter to remove features that have weak linear associations with the target.
Step 2: Feature Selection Using BorutaPy¶
After the initial filtering, we apply BorutaPy, which is an all-relevant feature selection method based on a Random Forest model. BorutaPy iteratively tests whether each feature is truly important by comparing its importance with randomized (shadow) features.
Boruta helps identify both strong and weak relevant features, including those with complex interactions, which might not be captured by linear tests like ANOVA or Chi-Square.
Step 3: Further Feature Selection Using Mutual Information¶
Once BorutaPy has selected the best features, we refine the selection using Mutual Information (MI). MI measures the dependency between each feature and the target, capturing both linear and non-linear relationships.
We select the top features based on their mutual information scores, typically choosing the same number of features as selected by BorutaPy (len(boruta_selected_features)).
Step 4: Model Building and Comparison¶
In this step, we train models using different sets of features and compare their performance. Specifically, we build and evaluate:
- Model 1: Using the full dataset with all features.
- Model 2: Using features selected by BorutaPy.
- Model 3: Using features selected by Mutual Information (refined from Boruta-selected features).
We then compare the performance of these models using relevant evaluation metrics (accuracy, precision, recall, F-1, AUC) to assess which feature selection method or dataset produces the best results.
univariate = f_classif(train[numerical_columns], train['type'])
univariate = pd.Series(univariate[1])
univariate.index = numerical_columns
univariate.sort_values(ascending=False).plot.bar(figsize=(10, 6))
plt.show()
f_classif¶
What is f_classif?¶
f_classif is a statistical test that helps evaluate the relationship between each feature and the target variable in classification tasks. It is based on the ANOVA (Analysis of Variance) test, which measures how much variance in the target variable is explained by each feature.
How Does f_classif Work?¶
Input:
- Takes features (X) and a categorical target (y).
ANOVA F-value:
- Computes an F-statistic for each feature, which shows the ratio of variance between the groups (classes) to the variance within each group.
Feature Importance:
- A higher F-value indicates that the feature provides more information to predict the target.
Output:
- Returns an array of F-values and their corresponding p-values, where smaller p-values suggest that the feature is statistically significant.
univariate = f_classif(train[numerical_columns], train['type'])
univariate = pd.Series(univariate[1])
univariate.index = numerical_columns
ax = univariate.sort_values(ascending=True).plot.bar(figsize=(10, 6))
ax.set_title('Univariate Feature Importance Using ANOVA F-test', fontsize=16)
ax.set_xlabel('Features', fontsize=14)
ax.set_ylabel('p-values', fontsize=14)
plt.show()
ANOVA F-test (f_classif) Results:¶
The F-test results reveal that all features have p-values equal to 0. This suggests that there is a significant difference in the mean values of the features across the target variable categories. Therefore, these features are highly relevant and important for predictive modeling.
Chi-Square Test¶
The Chi-Square Test evaluates the association between categorical variables.
Key Points:¶
- Types:
- Independence Test: Checks relationships between two variables.
- Goodness of Fit Test: Compares observed distributions to expected ones.
Process:¶
- Set null and alternative hypotheses.
- Create a contingency table.
- Calculate expected frequencies.
- Compute the Chi-Square statistic and p-value.
- Reject the null hypothesis if p < significance level (0.05).
pretability.
chi_ls = []
for feature in categorical_columns + word_cloud_columns:
c = pd.crosstab(train['type'], train[feature])
p_value = stats.chi2_contingency(c)[1]
chi_ls.append(p_value)
chi_test = pd.Series(chi_ls, index=categorical_columns + word_cloud_columns).sort_values(ascending=True)
chi_test.plot.bar(rot=45)
plt.ylabel('p value')
plt.title('Feature importance based on chi-square test')
plt.show()
Chi-Square Test Results:¶
The Chi-square test results show that all features have extremely low p-values, 0 or close to 0 (1.59e-147). This indicates a strong association between each feature and the target variable. These features are likely valuable predictors and should be considered for further analysis in the model.
Overall Summary:¶
From the feature selection tests conducted, both categorical and numerical features show strong influence on the target variable.
The Chi-square test revealed extremely low p-values for all categorical features, indicating a significant association with the target variable. This suggests that these features are highly informative and relevant for modeling.
The ANOVA F-test (f_classif) also showed p-values of 0 for all numerical features, meaning there is a significant difference in the means of these features across the different target categories. These numerical features are equally important for the model.
In conclusion, both categorical and numerical features demonstrate strong statistical significance and influence on the target, making them valuable predictors for further modeling.
Encoding Target (type) AND Split to X & y¶
encoding_map = {"benign": 0, "phishing": 1, "defacement": 2, "malware": 3}
train['type'] = train['type'].apply(lambda x: encoding_map[x])
test['type'] = test['type'].apply(lambda x: encoding_map[x])
X_train, y_train = train.drop('type', axis=1), train['type']
X_test, y_test = test.drop('type', axis=1), test['type']
BorutaPy¶
What is BorutaPy?¶
BorutaPy is a feature selection method specifically designed for use with tree-based models, such as Random Forest. It is an implementation of the Boruta algorithm, which is a wrapper method that iteratively selects the most important features while considering their importance in relation to shuffled, random features (called "shadow features").
The Boruta algorithm is designed to identify all relevant features, rather than just the minimal set required for good model performance. This makes it particularly useful in complex datasets where interactions between features are important.
How Does BorutaPy Work?¶
Create Shadow Features:
- Boruta duplicates the original features and shuffles them to create shadow features. These shadow features serve as a baseline for feature importance comparison.
Train Random Forest:
- A Random Forest model is trained on the dataset including both the original and shadow features.
Feature Importance Comparison:
- The algorithm computes feature importance scores for both original and shadow features.
- Features that have higher importance than the most important shadow feature are considered "important."
- Features with lower importance are considered "unimportant."
Iterative Process:
- The process is repeated for a set number of iterations or until all features are determined to be either important or unimportant.
- Features that remain undetermined after several iterations are categorized as "tentative."
Final Selection:
- At the end of the iterations, Boruta selects features that consistently show higher importance than shadow features, ensuring that no potentially relevant feature is discarded too early.
%%time
rf = RandomForestClassifier(n_estimators=20, random_state=RANDOM_STATE, n_jobs=4)
boruta_selector = BorutaPy(rf, n_estimators=4, random_state=RANDOM_STATE)
boruta_selector.fit(X_train.values, y_train)
boruta_selected_features_mask = boruta_selector.support_
boruta_selected_features = X_train.columns[boruta_selected_features_mask]
selected_feature_indices = np.where(boruta_selected_features_mask)[0]
selected_feature_importances = boruta_selector.estimator.feature_importances_[selected_feature_indices]
feature_importance_df = pd.DataFrame({
'Feature': boruta_selected_features,
'Importance': selected_feature_importances
})
feature_importance_df = feature_importance_df.sort_values(by='Importance', ascending=False)
plt.figure(figsize=(12, 6))
sns.barplot(x='Importance', y='Feature', data=feature_importance_df, palette='viridis')
plt.title('Feature Importances from BorutaPy')
plt.xlabel('Importance')
plt.ylabel('Feature')
plt.show()
print()
plot_corr_matrix(X_train[boruta_selected_features])
CPU times: total: 19min 12s Wall time: 6min 12s
Mutual Information for Classification - mutual_info_classif¶
What is Mutual Information?¶
Mutual information measures the amount of information gained about one variable by knowing another. In this context, it quantifies how much knowing a particular feature reduces uncertainty about the target class.
How Does mutual_info_classif Work?¶
Estimates Dependency:
- Mutual information measures the dependency between a feature and the target variable. A high score means that the feature provides a lot of information about the target (it helps reduce uncertainty about the target).
Captures Non-linear Relationships:
- Unlike methods like correlation that only capture linear relationships, mutual information can detect both linear and non-linear relationships between features and the target.
Ranges Between 0 and 1:
- Mutual information is always non-negative. A score of
0means the feature and the target are independent (the feature provides no useful information about the target).
- Mutual information is always non-negative. A score of
Discretization for Continuous Data:
- For continuous features,
mutual_info_classifuses k-nearest neighbors to discretize the data, allowing it to work effectively with both categorical and continuous data.
- For continuous features,
%%time
selector = SelectKBest(score_func=mutual_info_classif, k=len(boruta_selected_features))
selector.fit(X_train, y_train)
selected_features_mask = selector.get_support()
mi_selected_features = X_train.columns[selected_features_mask]
feature_scores = selector.scores_
selected_features_scores = pd.DataFrame({
'Feature': mi_selected_features,
'Score': feature_scores[selected_features_mask]
})
selected_features_scores = selected_features_scores.sort_values(by='Score', ascending=True)
plt.figure(figsize=(10, 6))
sns.barplot(x='Score', y='Feature', data=selected_features_scores, palette='viridis')
plt.title('Top Selected Features Based on Mutual Information Scores')
plt.xlabel('Mutual Information Score')
plt.ylabel('Selected Features')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()
print()
plot_corr_matrix(X_train[mi_selected_features])
CPU times: total: 3min 10s Wall time: 3min 12s
boruta_feature_set = set(boruta_selected_features)
mi_feature_set = set(mi_selected_features)
unique_to_boruta = boruta_feature_set - mi_feature_set
unique_to_mi = mi_feature_set - boruta_feature_set
common_features = boruta_feature_set.intersection(mi_feature_set)
print("Common Features:", common_features)
print("\nUnique to Boruta:", unique_to_boruta)
print("\nUnique to MI:", unique_to_mi)
Common Features: {'index_in_url', 'number_of_directories', 'special_char_ratio', 'path_length', 'sum_special_chars', 'count_digits', 'digit_char_ratio', 'count_http', 'tld_length', 'count_//', 'repeated_char_ratio', 'count_subdomain', 'count_dot', 'count_alpha', 'count_repeated_char', 'alpha_char_ratio', 'is_abnormal_url', 'url_length', 'has_subdomain', 'longest_digit_sequence', 'count_/', 'first_directory_length'}
Unique to Boruta: {'exe_in_url', 'count_-', 'count_https', 'domain_length'}
Unique to MI: {'count_parameters', 'count_=', 'php_in_url', 'option_in_url'}
Building Models¶
We will use two kinds of models:¶
- Traditional Models (
Base-tree models) + Optuna:XGBoostLightGBMCatBoost
Why use them?
- These models deliver high accuracy and efficiency on a wide range of predictive tasks.
- They are known for their scalability and are quite fast, allowing rapid training even on large datasets.
- Their ability to manage imbalanced data and outliers makes them highly flexible for real-world applications.
- They do not require feature scaling, making preprocessing simpler compared to models like logistic regression or SVM.
- They can handle correlated features effectively due to the way tree-based models work, reducing the impact of multicollinearity on performance.
- Deep Learning Models
- FNN
- FNN + BERT
Hyperparameter Optimization¶
What is Hyperparameter Optimization?¶
Hyperparameter optimization (HPO) is the process of selecting the best combination of hyperparameters for a machine learning model. Hyperparameters are the configuration settings that are not learned from the data during the training process, but instead set before training begins. Examples of hyperparameters include:
- Learning rate
- Regularization parameters
- Tree depth in tree-based models
HPO aims to improve model performance by tuning these hyperparameters to achieve the best possible predictive accuracy.
Why is Hyperparameter Optimization Important?¶
- Model Performance: The right set of hyperparameters can significantly enhance the model's accuracy, robustness, and generalization to unseen data.
- Avoiding Overfitting: Proper tuning helps to prevent overfitting, where a model performs well on training data but poorly on new data.
- Resource Efficiency: HPO helps to utilize computational resources effectively by identifying the best configurations quickly, avoiding unnecessary computations.
- Automating the Process: Automated HPO methods can save time and reduce human bias in the selection of hyperparameters.
What is Optuna?¶
Optuna is an open-source hyperparameter optimization framework that automates the hyperparameter tuning process. It is designed for flexibility and ease of use, allowing users to define optimization objectives and manage trials efficiently. Optuna employs advanced optimization algorithms to explore the hyperparameter space intelligently.
Why Use Optuna?¶
- Efficiency: Optuna uses sophisticated algorithms (like Tree-structured Parzen Estimator) to find optimal hyperparameters quickly, reducing the number of trials needed.
- Automatic Pruning: The pruning feature allows for real-time stopping of trials that are unlikely to perform well, further enhancing efficiency.
- Ease of Use: With its simple API, users can define complex optimization problems with minimal effort.
- Visualization: Optuna provides visualization tools to help analyze the optimization process, making it easier to understand hyperparameter effects.
- Community and Support: As an open-source tool, Optuna has strong community support and regular updates, ensuring it stays relevant and powerful.
In summary, hyperparameter optimization is crucial for improving machine learning model performance, and Optuna is a robust tool that facilitates efficient and effective HPO.
We will use Optuna to perform hyperparameter tuning for each model, optimizing for the best performance. This tuning will be complemented by k-fold cross-validation using cross_val_score from scikit-learn, which allows us to evaluate model performance robustly across different subsets of the data.
Traditional Models¶
def objective(trial: optuna.Trial, X_train, y_train, model):
classifier = classifier_name = type(model).__name__
sample_weight = compute_sample_weight('balanced', y_train)
if classifier == "XGBClassifier":
dtrain = xgb.DMatrix(X_train, label=y_train, weight=sample_weight)
eval_metric = trial.suggest_categorical("eval_metric", ['auc', 'mlogloss'])
num_boost_round = trial.suggest_int("n_estimators", 60, 100)
params = {
'max_depth': trial.suggest_int("max_depth", 5, 9),
'learning_rate': trial.suggest_float("learning_rate", 0.1, 0.3),
'subsample': trial.suggest_float("subsample", 0.7, 0.9),
'colsample_bynode': trial.suggest_float("colsample_bynode", 0.7, 0.9),
'colsample_bylevel': trial.suggest_float("colsample_bylevel", 0.7, 0.9),
'colsample_bytree': trial.suggest_float("colsample_bytree", 0.7, 0.9),
'reg_lambda': trial.suggest_float("reg_lambda", 1e-1, 1.0),
'reg_alpha': trial.suggest_float("reg_alpha", 1e-1, 1.0),
'min_child_weight': trial.suggest_float("min_child_weight", 0.5, 3.0),
'max_delta_step': trial.suggest_float("max_delta_step", 0.5, 3.0),
'booster': trial.suggest_categorical("booster", ['dart']),
'objective': trial.suggest_categorical("objective", ['multi:softmax']),
'num_class': trial.suggest_categorical("num_class", [4]),
'random_state': trial.suggest_categorical("random_state", [ RANDOM_STATE ]),
'nthread': trial.suggest_categorical("nthread", [ 4 ]),
'n_jobs': trial.suggest_categorical("n_jobs", [ 1 ]),
'eval_metric': eval_metric,
}
cv_results = xgb.cv(
params,
dtrain,
num_boost_round= num_boost_round,
metrics=[eval_metric],
nfold=3,
as_pandas=True,
shuffle=False
)
return cv_results['test-' + eval_metric + '-mean'].mean()
elif classifier == "LGBMClassifier":
dtrain = lgb.Dataset(X_train, label=y_train, weight=sample_weight)
params = {
'objective': trial.suggest_categorical("objective", [ 'multiclass' ]),
'metric': trial.suggest_categorical("eval_metric", ['multi_logloss']),
'num_class': trial.suggest_categorical("num_class", [ 4 ]),
'max_depth': trial.suggest_int("max_depth", 5, 9),
'num_leaves': trial.suggest_int("num_leaves", 31, 96),
'learning_rate': trial.suggest_float("learning_rate", 0.1, 0.3),
'boosting_type': trial.suggest_categorical("boosting_type", ['dart']),
'min_child_samples': trial.suggest_int("min_child_samples", 40, 100),
'min_child_weight': trial.suggest_loguniform('min_child_weight', 0.01, 1.0),
'subsample': trial.suggest_float("subsample", 0.7, 0.9),
'colsample_bytree': trial.suggest_float("colsample_bytree", 0.7, 0.9),
'reg_lambda': trial.suggest_float("reg_lambda", 1e-1, 1.0),
'reg_alpha': trial.suggest_float("reg_alpha", 1e-1, 1.0),
'seed': trial.suggest_categorical("seed", [ RANDOM_STATE ]),
'n_jobs': trial.suggest_categorical("n_jobs", [ 1 ]),
'force_row_wise': True,
'verbosity': -1
}
num_boost_round = trial.suggest_int('num_boost_round', 60, 100)
cv_results = lgb.cv(
params,
dtrain,
nfold=3,
num_boost_round=num_boost_round,
stratified=True,
)
return cv_results['valid multi_logloss-mean'][-1]
elif classifier == "CatBoostClassifier":
params = {
'iterations': trial.suggest_int("iterations", 60, 100),
'depth': trial.suggest_int("depth", 5, 9),
'learning_rate': trial.suggest_float("learning_rate", 0.1, 0.3, log=True),
'l2_leaf_reg': trial.suggest_float("l2_leaf_reg", 1e-1, 10.0),
'bootstrap_type': trial.suggest_categorical("bootstrap_type", ["Bayesian", "Bernoulli"]),
'random_strength': trial.suggest_float("random_strength", 0.1, 10.0),
'border_count': trial.suggest_int("border_count", 1, 127),
'random_seed': trial.suggest_categorical("random_seed", [RANDOM_STATE]),
'loss_function': 'MultiClass',
'logging_level': 'Silent'
}
if params['bootstrap_type'] == 'Bayesian':
params['bagging_temperature'] = trial.suggest_float("bagging_temperature", 0.0, 1.0)
cv_results = cat.cv(
cat.Pool(X_train, y_train, weight=sample_weight),
params=params,
nfold=3,
)
return cv_results['test-MultiClass-mean'].mean()
Handling Imbalanced Data in Machine Learning¶
What is Imbalanced Data?¶
Imbalanced data refers to a situation in machine learning where the distribution of classes is not uniform. In a multi-class classification problem, one or more classes can have significantly fewer samples compared to others. For instance, in the given case, the distribution is as follows:
- Label 0: 67.4% (
benign) - Label 1: 14.9% (
phishing) - Label 2: 13.8% (
defacement) - Label 3: 3.9% (
malware)
In this scenario, label 3 is heavily underrepresented, which can lead to biased predictions and inadequate learning for that class.
Why is Imbalanced Data a Problem?¶
Imbalanced data can lead to several issues in model training and evaluation:
- Biased Predictions: Models trained on imbalanced data tend to favor the majority class, leading to poor performance on the minority class.
- Misleading Metrics: Accuracy can be a misleading metric; a model that predicts the majority class well may still perform poorly overall if it fails to identify the minority class.
- Poor Generalization: The model may not learn the underlying patterns of the minority class, resulting in a lack of generalization and high error rates when making predictions on unseen data.
What is compute_sample_weight?¶
compute_sample_weight is a utility function from sklearn.utils that helps assign weights to samples in a dataset based on their class distribution. It generates a weight for each instance to give more importance to underrepresented classes.
Why Use compute_sample_weight with CatBoost, LightGBM, and XGBoost?¶
- Tree-Based Models: Algorithms like CatBoost, LightGBM, and XGBoost can benefit from sample weights because they allow the model to adjust the learning process according to the importance of each sample.
- Handling Imbalance: By applying sample weights, these models can better learn the minority class during training, thus improving classification performance across all classes.
- Flexible Weighting: This approach allows you to assign different weights based on additional context or business logic, further enhancing model performance on critical instances.
In conclusion, addressing imbalanced data through methods like compute_sample_weight is crucial for improving the robustness and accuracy of machine learning models, particularly in tree-based frameworks.
%%time
models = {
"XGB" : {"model": xgb.XGBClassifier},
"LGBM": {"model": lgb.LGBMClassifier},
"CAT" : {"model": cat.CatBoostClassifier},
}
feature_selection_techniques = {"all_features": X_train.columns, "Boruta_features": boruta_selected_features, "MI_features": mi_selected_features}
trials = 5
feature_set_result = {}
best_models = {}
for name, model_dict in models.items():
for key, feature_list in feature_selection_techniques.items():
print(f"Running {name} with - {key}")
study = optuna.create_study(direction='maximize', sampler=optuna.samplers.TPESampler(seed=RANDOM_STATE))
start_time = time.time()
study.optimize(lambda trial: objective(trial, X_train[feature_list], y_train, model_dict["model"]()), n_trials=trials, n_jobs=1)
end_time = time.time()
optuna_time_elapsed = end_time - start_time
best_params = study.best_trial.params
best_model = model_dict["model"](**best_params)
sample_weight = compute_sample_weight('balanced', y_train)
start_time = time.time()
best_model.fit(X_train[feature_list], y_train, sample_weight=sample_weight)
end_time = time.time()
time_elapsed = end_time - start_time
y_pred = best_model.predict(X_test[feature_list])
accuracy = accuracy_score( y_test, y_pred )
precision = precision_score( y_test, y_pred, average='macro' )
recall = recall_score( y_test, y_pred, average='macro' )
f1 = f1_score( y_test, y_pred, average='macro' )
if name not in best_models:
best_models[name] = {}
best_models[name][key] = best_model
if key not in feature_set_result:
feature_set_result[key] = {}
feature_set_result[key][name] = {
'Accuracy': accuracy,
'Precision': precision,
'Recall': recall,
'F1-Score': f1,
'TimeElapsed': time_elapsed,
'Optuna_TimeElapsed': optuna_time_elapsed
}
print("\n----------------------------------------------------\n")
[I 2024-10-23 09:31:47,686] A new study created in memory with name: no-name-6848f628-0dbc-4950-b8df-cb664c6edf43
Running XGB with - all_features
[I 2024-10-23 09:53:53,590] Trial 0 finished with value: 0.3726073402698043 and parameters: {'eval_metric': 'mlogloss', 'n_estimators': 67, 'max_depth': 5, 'learning_rate': 0.14100379047885775, 'subsample': 0.7212125748952531, 'colsample_bynode': 0.845448028736891, 'colsample_bylevel': 0.8358801047050284, 'colsample_bytree': 0.7947691406816437, 'reg_lambda': 0.5034662420322742, 'reg_alpha': 0.11719625308521943, 'min_child_weight': 2.3814958430214483, 'max_delta_step': 2.0061213475203163, 'booster': 'dart', 'objective': 'multi:softmax', 'num_class': 4, 'random_state': 2024, 'nthread': 4, 'n_jobs': 1}. Best is trial 0 with value: 0.3726073402698043.
[I 2024-10-23 10:32:21,768] Trial 1 finished with value: 0.9893102220774443 and parameters: {'eval_metric': 'auc', 'n_estimators': 84, 'max_depth': 7, 'learning_rate': 0.1450708326385391, 'subsample': 0.8340348593785392, 'colsample_bynode': 0.8471533184903827, 'colsample_bylevel': 0.7515991276156387, 'colsample_bytree': 0.7191084307720731, 'reg_lambda': 0.9648187680130099, 'reg_alpha': 0.32659055809120996, 'min_child_weight': 1.205412798609108, 'max_delta_step': 2.420634836656963, 'booster': 'dart', 'objective': 'multi:softmax', 'num_class': 4, 'random_state': 2024, 'nthread': 4, 'n_jobs': 1}. Best is trial 1 with value: 0.9893102220774443.
[I 2024-10-23 10:57:57,064] Trial 2 finished with value: 0.9871548896245902 and parameters: {'eval_metric': 'auc', 'n_estimators': 75, 'max_depth': 6, 'learning_rate': 0.1571654776954366, 'subsample': 0.8480536306281314, 'colsample_bynode': 0.7477973664871458, 'colsample_bylevel': 0.7875443409299727, 'colsample_bytree': 0.8767077405553172, 'reg_lambda': 0.3603530262994459, 'reg_alpha': 0.8060561713786789, 'min_child_weight': 2.397384141933976, 'max_delta_step': 1.5444634623360838, 'booster': 'dart', 'objective': 'multi:softmax', 'num_class': 4, 'random_state': 2024, 'nthread': 4, 'n_jobs': 1}. Best is trial 1 with value: 0.9893102220774443.
[I 2024-10-23 11:15:55,937] Trial 3 finished with value: 0.2532936644764878 and parameters: {'eval_metric': 'mlogloss', 'n_estimators': 62, 'max_depth': 7, 'learning_rate': 0.26746474446216606, 'subsample': 0.8784972772658111, 'colsample_bynode': 0.7401054887766454, 'colsample_bylevel': 0.8004790468730579, 'colsample_bytree': 0.879076368892255, 'reg_lambda': 0.3303288382494436, 'reg_alpha': 0.8805091086696613, 'min_child_weight': 0.5412198370415773, 'max_delta_step': 1.8812423859276484, 'booster': 'dart', 'objective': 'multi:softmax', 'num_class': 4, 'random_state': 2024, 'nthread': 4, 'n_jobs': 1}. Best is trial 1 with value: 0.9893102220774443.
[I 2024-10-23 11:37:19,235] Trial 4 finished with value: 0.2803199337494971 and parameters: {'eval_metric': 'mlogloss', 'n_estimators': 70, 'max_depth': 5, 'learning_rate': 0.28042094091723957, 'subsample': 0.8748079600748659, 'colsample_bynode': 0.7327334581105734, 'colsample_bylevel': 0.8999482613372792, 'colsample_bytree': 0.7693607940693447, 'reg_lambda': 0.3815903434097426, 'reg_alpha': 0.8623936188913988, 'min_child_weight': 2.7005777565812403, 'max_delta_step': 2.191396628866615, 'booster': 'dart', 'objective': 'multi:softmax', 'num_class': 4, 'random_state': 2024, 'nthread': 4, 'n_jobs': 1}. Best is trial 1 with value: 0.9893102220774443.
[I 2024-10-23 12:30:39,896] A new study created in memory with name: no-name-92373be5-dac7-400c-a0ae-556de54309da
Running XGB with - Boruta_features
[I 2024-10-23 12:43:59,433] Trial 0 finished with value: 0.38226456908590534 and parameters: {'eval_metric': 'mlogloss', 'n_estimators': 67, 'max_depth': 5, 'learning_rate': 0.14100379047885775, 'subsample': 0.7212125748952531, 'colsample_bynode': 0.845448028736891, 'colsample_bylevel': 0.8358801047050284, 'colsample_bytree': 0.7947691406816437, 'reg_lambda': 0.5034662420322742, 'reg_alpha': 0.11719625308521943, 'min_child_weight': 2.3814958430214483, 'max_delta_step': 2.0061213475203163, 'booster': 'dart', 'objective': 'multi:softmax', 'num_class': 4, 'random_state': 2024, 'nthread': 4, 'n_jobs': 1}. Best is trial 0 with value: 0.38226456908590534.
[I 2024-10-23 13:06:48,202] Trial 1 finished with value: 0.9881188472914543 and parameters: {'eval_metric': 'auc', 'n_estimators': 84, 'max_depth': 7, 'learning_rate': 0.1450708326385391, 'subsample': 0.8340348593785392, 'colsample_bynode': 0.8471533184903827, 'colsample_bylevel': 0.7515991276156387, 'colsample_bytree': 0.7191084307720731, 'reg_lambda': 0.9648187680130099, 'reg_alpha': 0.32659055809120996, 'min_child_weight': 1.205412798609108, 'max_delta_step': 2.420634836656963, 'booster': 'dart', 'objective': 'multi:softmax', 'num_class': 4, 'random_state': 2024, 'nthread': 4, 'n_jobs': 1}. Best is trial 1 with value: 0.9881188472914543.
[I 2024-10-23 13:25:28,732] Trial 2 finished with value: 0.9858997093684756 and parameters: {'eval_metric': 'auc', 'n_estimators': 75, 'max_depth': 6, 'learning_rate': 0.1571654776954366, 'subsample': 0.8480536306281314, 'colsample_bynode': 0.7477973664871458, 'colsample_bylevel': 0.7875443409299727, 'colsample_bytree': 0.8767077405553172, 'reg_lambda': 0.3603530262994459, 'reg_alpha': 0.8060561713786789, 'min_child_weight': 2.397384141933976, 'max_delta_step': 1.5444634623360838, 'booster': 'dart', 'objective': 'multi:softmax', 'num_class': 4, 'random_state': 2024, 'nthread': 4, 'n_jobs': 1}. Best is trial 1 with value: 0.9881188472914543.
[I 2024-10-23 13:38:04,560] Trial 3 finished with value: 0.2650465790535372 and parameters: {'eval_metric': 'mlogloss', 'n_estimators': 62, 'max_depth': 7, 'learning_rate': 0.26746474446216606, 'subsample': 0.8784972772658111, 'colsample_bynode': 0.7401054887766454, 'colsample_bylevel': 0.8004790468730579, 'colsample_bytree': 0.879076368892255, 'reg_lambda': 0.3303288382494436, 'reg_alpha': 0.8805091086696613, 'min_child_weight': 0.5412198370415773, 'max_delta_step': 1.8812423859276484, 'booster': 'dart', 'objective': 'multi:softmax', 'num_class': 4, 'random_state': 2024, 'nthread': 4, 'n_jobs': 1}. Best is trial 1 with value: 0.9881188472914543.
[I 2024-10-23 13:52:30,938] Trial 4 finished with value: 0.2909791570453547 and parameters: {'eval_metric': 'mlogloss', 'n_estimators': 70, 'max_depth': 5, 'learning_rate': 0.28042094091723957, 'subsample': 0.8748079600748659, 'colsample_bynode': 0.7327334581105734, 'colsample_bylevel': 0.8999482613372792, 'colsample_bytree': 0.7693607940693447, 'reg_lambda': 0.3815903434097426, 'reg_alpha': 0.8623936188913988, 'min_child_weight': 2.7005777565812403, 'max_delta_step': 2.191396628866615, 'booster': 'dart', 'objective': 'multi:softmax', 'num_class': 4, 'random_state': 2024, 'nthread': 4, 'n_jobs': 1}. Best is trial 1 with value: 0.9881188472914543.
[I 2024-10-23 14:22:10,733] A new study created in memory with name: no-name-6fe0953d-44e6-41a1-9acf-8732974a3815
Running XGB with - MI_features
[I 2024-10-23 14:35:30,272] Trial 0 finished with value: 0.4094491047567859 and parameters: {'eval_metric': 'mlogloss', 'n_estimators': 67, 'max_depth': 5, 'learning_rate': 0.14100379047885775, 'subsample': 0.7212125748952531, 'colsample_bynode': 0.845448028736891, 'colsample_bylevel': 0.8358801047050284, 'colsample_bytree': 0.7947691406816437, 'reg_lambda': 0.5034662420322742, 'reg_alpha': 0.11719625308521943, 'min_child_weight': 2.3814958430214483, 'max_delta_step': 2.0061213475203163, 'booster': 'dart', 'objective': 'multi:softmax', 'num_class': 4, 'random_state': 2024, 'nthread': 4, 'n_jobs': 1}. Best is trial 0 with value: 0.4094491047567859.
[I 2024-10-23 15:00:18,849] Trial 1 finished with value: 0.98652911126521 and parameters: {'eval_metric': 'auc', 'n_estimators': 84, 'max_depth': 7, 'learning_rate': 0.1450708326385391, 'subsample': 0.8340348593785392, 'colsample_bynode': 0.8471533184903827, 'colsample_bylevel': 0.7515991276156387, 'colsample_bytree': 0.7191084307720731, 'reg_lambda': 0.9648187680130099, 'reg_alpha': 0.32659055809120996, 'min_child_weight': 1.205412798609108, 'max_delta_step': 2.420634836656963, 'booster': 'dart', 'objective': 'multi:softmax', 'num_class': 4, 'random_state': 2024, 'nthread': 4, 'n_jobs': 1}. Best is trial 1 with value: 0.98652911126521.
[I 2024-10-23 15:19:20,253] Trial 2 finished with value: 0.9840133866564393 and parameters: {'eval_metric': 'auc', 'n_estimators': 75, 'max_depth': 6, 'learning_rate': 0.1571654776954366, 'subsample': 0.8480536306281314, 'colsample_bynode': 0.7477973664871458, 'colsample_bylevel': 0.7875443409299727, 'colsample_bytree': 0.8767077405553172, 'reg_lambda': 0.3603530262994459, 'reg_alpha': 0.8060561713786789, 'min_child_weight': 2.397384141933976, 'max_delta_step': 1.5444634623360838, 'booster': 'dart', 'objective': 'multi:softmax', 'num_class': 4, 'random_state': 2024, 'nthread': 4, 'n_jobs': 1}. Best is trial 1 with value: 0.98652911126521.
[I 2024-10-23 15:31:48,659] Trial 3 finished with value: 0.28565134511602036 and parameters: {'eval_metric': 'mlogloss', 'n_estimators': 62, 'max_depth': 7, 'learning_rate': 0.26746474446216606, 'subsample': 0.8784972772658111, 'colsample_bynode': 0.7401054887766454, 'colsample_bylevel': 0.8004790468730579, 'colsample_bytree': 0.879076368892255, 'reg_lambda': 0.3303288382494436, 'reg_alpha': 0.8805091086696613, 'min_child_weight': 0.5412198370415773, 'max_delta_step': 1.8812423859276484, 'booster': 'dart', 'objective': 'multi:softmax', 'num_class': 4, 'random_state': 2024, 'nthread': 4, 'n_jobs': 1}. Best is trial 1 with value: 0.98652911126521.
[I 2024-10-23 15:46:01,117] Trial 4 finished with value: 0.3178367951841102 and parameters: {'eval_metric': 'mlogloss', 'n_estimators': 70, 'max_depth': 5, 'learning_rate': 0.28042094091723957, 'subsample': 0.8748079600748659, 'colsample_bynode': 0.7327334581105734, 'colsample_bylevel': 0.8999482613372792, 'colsample_bytree': 0.7693607940693447, 'reg_lambda': 0.3815903434097426, 'reg_alpha': 0.8623936188913988, 'min_child_weight': 2.7005777565812403, 'max_delta_step': 2.191396628866615, 'booster': 'dart', 'objective': 'multi:softmax', 'num_class': 4, 'random_state': 2024, 'nthread': 4, 'n_jobs': 1}. Best is trial 1 with value: 0.98652911126521.
[I 2024-10-23 16:15:49,202] A new study created in memory with name: no-name-a0e2dfff-ad1e-430a-83da-8c2658eb7f24
---------------------------------------------------- Running LGBM with - all_features
[I 2024-10-23 16:20:18,471] Trial 0 finished with value: 0.21128037664698396 and parameters: {'objective': 'multiclass', 'eval_metric': 'multi_logloss', 'num_class': 4, 'max_depth': 7, 'num_leaves': 77, 'learning_rate': 0.1376303920077012, 'boosting_type': 'dart', 'min_child_samples': 42, 'min_child_weight': 0.025706201341357374, 'subsample': 0.7212125748952531, 'colsample_bytree': 0.845448028736891, 'reg_lambda': 0.7114604711726275, 'reg_alpha': 0.5264611330673967, 'seed': 2024, 'n_jobs': 1, 'num_boost_round': 78}. Best is trial 0 with value: 0.21128037664698396.
[I 2024-10-23 16:24:34,273] Trial 1 finished with value: 0.2229241640471742 and parameters: {'objective': 'multiclass', 'eval_metric': 'multi_logloss', 'num_class': 4, 'max_depth': 5, 'num_leaves': 80, 'learning_rate': 0.2204897078016253, 'boosting_type': 'dart', 'min_child_samples': 98, 'min_child_weight': 0.21317550217208076, 'subsample': 0.8213259238637353, 'colsample_bytree': 0.7898302629863433, 'reg_lambda': 0.30281874687342597, 'reg_alpha': 0.7031568672034261, 'seed': 2024, 'n_jobs': 1, 'num_boost_round': 90}. Best is trial 1 with value: 0.2229241640471742.
[I 2024-10-23 16:28:04,368] Trial 2 finished with value: 0.18669233216569361 and parameters: {'objective': 'multiclass', 'eval_metric': 'multi_logloss', 'num_class': 4, 'max_depth': 6, 'num_leaves': 37, 'learning_rate': 0.2921819484473355, 'boosting_type': 'dart', 'min_child_samples': 55, 'min_child_weight': 0.036671632089617025, 'subsample': 0.853650786932557, 'colsample_bytree': 0.8595846794229967, 'reg_lambda': 0.5896334785603745, 'reg_alpha': 0.44443686758197776, 'seed': 2024, 'n_jobs': 1, 'num_boost_round': 75}. Best is trial 1 with value: 0.2229241640471742.
[I 2024-10-23 16:31:05,025] Trial 3 finished with value: 0.22643134709754595 and parameters: {'objective': 'multiclass', 'eval_metric': 'multi_logloss', 'num_class': 4, 'max_depth': 6, 'num_leaves': 79, 'learning_rate': 0.1477973664871458, 'boosting_type': 'dart', 'min_child_samples': 66, 'min_child_weight': 0.5848943222502431, 'subsample': 0.7578562280665435, 'colsample_bytree': 0.8569013714174842, 'reg_lambda': 0.7830582910962314, 'reg_alpha': 0.47600684644099023, 'seed': 2024, 'n_jobs': 1, 'num_boost_round': 69}. Best is trial 3 with value: 0.22643134709754595.
[I 2024-10-23 16:36:38,141] Trial 4 finished with value: 0.18538714391724262 and parameters: {'objective': 'multiclass', 'eval_metric': 'multi_logloss', 'num_class': 4, 'max_depth': 7, 'num_leaves': 35, 'learning_rate': 0.21928653738419931, 'boosting_type': 'dart', 'min_child_samples': 91, 'min_child_weight': 0.6094986845757975, 'subsample': 0.7401054887766454, 'colsample_bytree': 0.8004790468730579, 'reg_lambda': 0.9058436600151478, 'reg_alpha': 0.3303288382494436, 'seed': 2024, 'n_jobs': 1, 'num_boost_round': 95}. Best is trial 3 with value: 0.22643134709754595.
[I 2024-10-23 16:37:51,994] A new study created in memory with name: no-name-14a5bfb0-d65d-4048-a3a8-6997bde63d13
Running LGBM with - Boruta_features
[I 2024-10-23 16:42:15,497] Trial 0 finished with value: 0.22158941529393283 and parameters: {'objective': 'multiclass', 'eval_metric': 'multi_logloss', 'num_class': 4, 'max_depth': 7, 'num_leaves': 77, 'learning_rate': 0.1376303920077012, 'boosting_type': 'dart', 'min_child_samples': 42, 'min_child_weight': 0.025706201341357374, 'subsample': 0.7212125748952531, 'colsample_bytree': 0.845448028736891, 'reg_lambda': 0.7114604711726275, 'reg_alpha': 0.5264611330673967, 'seed': 2024, 'n_jobs': 1, 'num_boost_round': 78}. Best is trial 0 with value: 0.22158941529393283.
[I 2024-10-23 16:46:33,768] Trial 1 finished with value: 0.2351351626097232 and parameters: {'objective': 'multiclass', 'eval_metric': 'multi_logloss', 'num_class': 4, 'max_depth': 5, 'num_leaves': 80, 'learning_rate': 0.2204897078016253, 'boosting_type': 'dart', 'min_child_samples': 98, 'min_child_weight': 0.21317550217208076, 'subsample': 0.8213259238637353, 'colsample_bytree': 0.7898302629863433, 'reg_lambda': 0.30281874687342597, 'reg_alpha': 0.7031568672034261, 'seed': 2024, 'n_jobs': 1, 'num_boost_round': 90}. Best is trial 1 with value: 0.2351351626097232.
[I 2024-10-23 16:50:13,568] Trial 2 finished with value: 0.20046011219796397 and parameters: {'objective': 'multiclass', 'eval_metric': 'multi_logloss', 'num_class': 4, 'max_depth': 6, 'num_leaves': 37, 'learning_rate': 0.2921819484473355, 'boosting_type': 'dart', 'min_child_samples': 55, 'min_child_weight': 0.036671632089617025, 'subsample': 0.853650786932557, 'colsample_bytree': 0.8595846794229967, 'reg_lambda': 0.5896334785603745, 'reg_alpha': 0.44443686758197776, 'seed': 2024, 'n_jobs': 1, 'num_boost_round': 75}. Best is trial 1 with value: 0.2351351626097232.
[I 2024-10-23 16:53:16,192] Trial 3 finished with value: 0.23679878016092668 and parameters: {'objective': 'multiclass', 'eval_metric': 'multi_logloss', 'num_class': 4, 'max_depth': 6, 'num_leaves': 79, 'learning_rate': 0.1477973664871458, 'boosting_type': 'dart', 'min_child_samples': 66, 'min_child_weight': 0.5848943222502431, 'subsample': 0.7578562280665435, 'colsample_bytree': 0.8569013714174842, 'reg_lambda': 0.7830582910962314, 'reg_alpha': 0.47600684644099023, 'seed': 2024, 'n_jobs': 1, 'num_boost_round': 69}. Best is trial 3 with value: 0.23679878016092668.
[I 2024-10-23 16:58:54,234] Trial 4 finished with value: 0.19816187937138333 and parameters: {'objective': 'multiclass', 'eval_metric': 'multi_logloss', 'num_class': 4, 'max_depth': 7, 'num_leaves': 35, 'learning_rate': 0.21928653738419931, 'boosting_type': 'dart', 'min_child_samples': 91, 'min_child_weight': 0.6094986845757975, 'subsample': 0.7401054887766454, 'colsample_bytree': 0.8004790468730579, 'reg_lambda': 0.9058436600151478, 'reg_alpha': 0.3303288382494436, 'seed': 2024, 'n_jobs': 1, 'num_boost_round': 95}. Best is trial 3 with value: 0.23679878016092668.
[I 2024-10-23 17:00:07,368] A new study created in memory with name: no-name-f7a3d15b-333d-4efb-88f5-8c0a43d8d060
Running LGBM with - MI_features
[I 2024-10-23 17:04:42,174] Trial 0 finished with value: 0.24095179928752009 and parameters: {'objective': 'multiclass', 'eval_metric': 'multi_logloss', 'num_class': 4, 'max_depth': 7, 'num_leaves': 77, 'learning_rate': 0.1376303920077012, 'boosting_type': 'dart', 'min_child_samples': 42, 'min_child_weight': 0.025706201341357374, 'subsample': 0.7212125748952531, 'colsample_bytree': 0.845448028736891, 'reg_lambda': 0.7114604711726275, 'reg_alpha': 0.5264611330673967, 'seed': 2024, 'n_jobs': 1, 'num_boost_round': 78}. Best is trial 0 with value: 0.24095179928752009.
[I 2024-10-23 17:09:11,312] Trial 1 finished with value: 0.2591430714091621 and parameters: {'objective': 'multiclass', 'eval_metric': 'multi_logloss', 'num_class': 4, 'max_depth': 5, 'num_leaves': 80, 'learning_rate': 0.2204897078016253, 'boosting_type': 'dart', 'min_child_samples': 98, 'min_child_weight': 0.21317550217208076, 'subsample': 0.8213259238637353, 'colsample_bytree': 0.7898302629863433, 'reg_lambda': 0.30281874687342597, 'reg_alpha': 0.7031568672034261, 'seed': 2024, 'n_jobs': 1, 'num_boost_round': 90}. Best is trial 1 with value: 0.2591430714091621.
[I 2024-10-23 17:12:51,626] Trial 2 finished with value: 0.22137292794654048 and parameters: {'objective': 'multiclass', 'eval_metric': 'multi_logloss', 'num_class': 4, 'max_depth': 6, 'num_leaves': 37, 'learning_rate': 0.2921819484473355, 'boosting_type': 'dart', 'min_child_samples': 55, 'min_child_weight': 0.036671632089617025, 'subsample': 0.853650786932557, 'colsample_bytree': 0.8595846794229967, 'reg_lambda': 0.5896334785603745, 'reg_alpha': 0.44443686758197776, 'seed': 2024, 'n_jobs': 1, 'num_boost_round': 75}. Best is trial 1 with value: 0.2591430714091621.
[I 2024-10-23 17:16:04,539] Trial 3 finished with value: 0.25893959322359505 and parameters: {'objective': 'multiclass', 'eval_metric': 'multi_logloss', 'num_class': 4, 'max_depth': 6, 'num_leaves': 79, 'learning_rate': 0.1477973664871458, 'boosting_type': 'dart', 'min_child_samples': 66, 'min_child_weight': 0.5848943222502431, 'subsample': 0.7578562280665435, 'colsample_bytree': 0.8569013714174842, 'reg_lambda': 0.7830582910962314, 'reg_alpha': 0.47600684644099023, 'seed': 2024, 'n_jobs': 1, 'num_boost_round': 69}. Best is trial 1 with value: 0.2591430714091621.
[I 2024-10-23 17:21:44,786] Trial 4 finished with value: 0.22016438352603665 and parameters: {'objective': 'multiclass', 'eval_metric': 'multi_logloss', 'num_class': 4, 'max_depth': 7, 'num_leaves': 35, 'learning_rate': 0.21928653738419931, 'boosting_type': 'dart', 'min_child_samples': 91, 'min_child_weight': 0.6094986845757975, 'subsample': 0.7401054887766454, 'colsample_bytree': 0.8004790468730579, 'reg_lambda': 0.9058436600151478, 'reg_alpha': 0.3303288382494436, 'seed': 2024, 'n_jobs': 1, 'num_boost_round': 95}. Best is trial 1 with value: 0.2591430714091621.
[I 2024-10-23 17:23:32,857] A new study created in memory with name: no-name-dd58db87-07d0-4590-8bda-cafa57b5bf1f
---------------------------------------------------- Running CAT with - all_features
[I 2024-10-23 17:25:28,195] Trial 0 finished with value: 0.3748505945661208 and parameters: {'iterations': 84, 'depth': 8, 'learning_rate': 0.12296210782212309, 'l2_leaf_reg': 0.5337047810939617, 'bootstrap_type': 'Bayesian', 'random_strength': 7.299677422476102, 'border_count': 87, 'random_seed': 2024, 'bagging_temperature': 0.4738457034082185}. Best is trial 0 with value: 0.3748505945661208.
[I 2024-10-23 17:26:05,876] Trial 1 finished with value: 0.3626357377698067 and parameters: {'iterations': 78, 'depth': 5, 'learning_rate': 0.2286023354617474, 'l2_leaf_reg': 6.064240536180453, 'bootstrap_type': 'Bayesian', 'random_strength': 6.105633231254895, 'border_count': 58, 'random_seed': 2024, 'bagging_temperature': 0.22535416319269552}. Best is trial 0 with value: 0.3748505945661208.
[I 2024-10-23 17:28:04,527] Trial 2 finished with value: 0.33623335450939446 and parameters: {'iterations': 87, 'depth': 8, 'learning_rate': 0.1327685470388916, 'l2_leaf_reg': 1.045867323217618, 'bootstrap_type': 'Bayesian', 'random_strength': 2.893434682492068, 'border_count': 98, 'random_seed': 2024, 'bagging_temperature': 0.7979233971149834}. Best is trial 0 with value: 0.3748505945661208.
[I 2024-10-23 17:28:48,849] Trial 3 finished with value: 0.3759487722415203 and parameters: {'iterations': 82, 'depth': 6, 'learning_rate': 0.152087590757984, 'l2_leaf_reg': 2.929691145924111, 'bootstrap_type': 'Bayesian', 'random_strength': 4.433444876033651, 'border_count': 113, 'random_seed': 2024, 'bagging_temperature': 0.2892811403327177}. Best is trial 3 with value: 0.3759487722415203.
[I 2024-10-23 17:30:58,296] Trial 4 finished with value: 0.3276478659520506 and parameters: {'iterations': 92, 'depth': 8, 'learning_rate': 0.15824656330287373, 'l2_leaf_reg': 2.335110799201222, 'bootstrap_type': 'Bayesian', 'random_strength': 6.0046836005178665, 'border_count': 107, 'random_seed': 2024, 'bagging_temperature': 0.8924863863290551}. Best is trial 3 with value: 0.3759487722415203.
0: learn: 1.1325150 total: 402ms remaining: 32.6s 1: learn: 0.9737351 total: 759ms remaining: 30.3s 2: learn: 0.8614875 total: 1.05s remaining: 27.8s 3: learn: 0.7789055 total: 1.31s remaining: 25.5s 4: learn: 0.7149616 total: 1.57s remaining: 24.1s 5: learn: 0.6607801 total: 1.83s remaining: 23.2s 6: learn: 0.6175241 total: 2.04s remaining: 21.9s 7: learn: 0.5845120 total: 2.33s remaining: 21.5s 8: learn: 0.5539147 total: 2.59s remaining: 21s 9: learn: 0.5277306 total: 2.93s remaining: 21.1s 10: learn: 0.5054359 total: 3.24s remaining: 20.9s 11: learn: 0.4863717 total: 3.49s remaining: 20.3s 12: learn: 0.4696365 total: 3.69s remaining: 19.6s 13: learn: 0.4546255 total: 3.96s remaining: 19.2s 14: learn: 0.4438541 total: 4.21s remaining: 18.8s 15: learn: 0.4286878 total: 4.44s remaining: 18.3s 16: learn: 0.4176373 total: 4.67s remaining: 17.9s 17: learn: 0.4095670 total: 4.86s remaining: 17.3s 18: learn: 0.4035267 total: 5.08s remaining: 16.8s 19: learn: 0.3965299 total: 5.36s remaining: 16.6s 20: learn: 0.3881775 total: 5.65s remaining: 16.4s 21: learn: 0.3801253 total: 5.98s remaining: 16.3s 22: learn: 0.3758483 total: 6.25s remaining: 16s 23: learn: 0.3705588 total: 6.52s remaining: 15.8s 24: learn: 0.3638557 total: 6.81s remaining: 15.5s 25: learn: 0.3583066 total: 7.02s remaining: 15.1s 26: learn: 0.3545156 total: 7.19s remaining: 14.6s 27: learn: 0.3506168 total: 7.44s remaining: 14.3s 28: learn: 0.3482962 total: 7.67s remaining: 14s 29: learn: 0.3425738 total: 7.94s remaining: 13.8s 30: learn: 0.3369936 total: 8.2s remaining: 13.5s 31: learn: 0.3339771 total: 8.42s remaining: 13.2s 32: learn: 0.3303205 total: 8.65s remaining: 12.8s 33: learn: 0.3276974 total: 8.89s remaining: 12.5s 34: learn: 0.3249905 total: 9.09s remaining: 12.2s 35: learn: 0.3228160 total: 9.29s remaining: 11.9s 36: learn: 0.3195114 total: 9.5s remaining: 11.6s 37: learn: 0.3166795 total: 9.75s remaining: 11.3s 38: learn: 0.3127664 total: 10s remaining: 11s 39: learn: 0.3105132 total: 10.2s remaining: 10.8s 40: learn: 0.3082892 total: 10.5s remaining: 10.5s 41: learn: 0.3060667 total: 10.8s remaining: 10.3s 42: learn: 0.3026135 total: 11.1s remaining: 10.1s 43: learn: 0.2997614 total: 11.5s remaining: 9.96s 44: learn: 0.2969568 total: 11.8s remaining: 9.73s 45: learn: 0.2957830 total: 12.1s remaining: 9.46s 46: learn: 0.2947311 total: 12.3s remaining: 9.13s 47: learn: 0.2929072 total: 12.5s remaining: 8.82s 48: learn: 0.2914847 total: 12.7s remaining: 8.55s 49: learn: 0.2891016 total: 13s remaining: 8.31s 50: learn: 0.2881824 total: 13.2s remaining: 8.01s 51: learn: 0.2868008 total: 13.4s remaining: 7.72s 52: learn: 0.2856366 total: 13.6s remaining: 7.43s 53: learn: 0.2840025 total: 13.8s remaining: 7.18s 54: learn: 0.2827444 total: 14s remaining: 6.89s 55: learn: 0.2818614 total: 14.3s remaining: 6.63s 56: learn: 0.2798854 total: 14.5s remaining: 6.36s 57: learn: 0.2790643 total: 14.7s remaining: 6.09s 58: learn: 0.2778777 total: 14.9s remaining: 5.82s 59: learn: 0.2765767 total: 15.2s remaining: 5.55s 60: learn: 0.2741460 total: 15.4s remaining: 5.32s 61: learn: 0.2732513 total: 15.6s remaining: 5.04s 62: learn: 0.2716021 total: 15.9s remaining: 4.79s 63: learn: 0.2693366 total: 16.1s remaining: 4.53s 64: learn: 0.2683629 total: 16.3s remaining: 4.26s 65: learn: 0.2674920 total: 16.5s remaining: 4s 66: learn: 0.2657599 total: 16.8s remaining: 3.75s 67: learn: 0.2648485 total: 17s remaining: 3.51s 68: learn: 0.2638311 total: 17.3s remaining: 3.25s 69: learn: 0.2628448 total: 17.5s remaining: 3s 70: learn: 0.2623515 total: 17.8s remaining: 2.75s 71: learn: 0.2612747 total: 18s remaining: 2.5s 72: learn: 0.2602216 total: 18.2s remaining: 2.25s 73: learn: 0.2592721 total: 18.4s remaining: 1.99s 74: learn: 0.2583134 total: 18.7s remaining: 1.74s 75: learn: 0.2570396 total: 18.9s remaining: 1.49s 76: learn: 0.2562873 total: 19.2s remaining: 1.24s 77: learn: 0.2559071 total: 19.4s remaining: 993ms 78: learn: 0.2548139 total: 19.6s remaining: 743ms 79: learn: 0.2533138 total: 19.8s remaining: 495ms 80: learn: 0.2520942 total: 20.1s remaining: 248ms 81: learn: 0.2512924 total: 20.4s remaining: 0us
[I 2024-10-23 17:31:20,269] A new study created in memory with name: no-name-bd296dc2-4441-4249-a319-9c5e0bbc9120
Running CAT with - Boruta_features
[I 2024-10-23 17:32:49,271] Trial 0 finished with value: 0.38440707201648483 and parameters: {'iterations': 84, 'depth': 8, 'learning_rate': 0.12296210782212309, 'l2_leaf_reg': 0.5337047810939617, 'bootstrap_type': 'Bayesian', 'random_strength': 7.299677422476102, 'border_count': 87, 'random_seed': 2024, 'bagging_temperature': 0.4738457034082185}. Best is trial 0 with value: 0.38440707201648483.
[I 2024-10-23 17:33:23,262] Trial 1 finished with value: 0.3678413147143959 and parameters: {'iterations': 78, 'depth': 5, 'learning_rate': 0.2286023354617474, 'l2_leaf_reg': 6.064240536180453, 'bootstrap_type': 'Bayesian', 'random_strength': 6.105633231254895, 'border_count': 58, 'random_seed': 2024, 'bagging_temperature': 0.22535416319269552}. Best is trial 0 with value: 0.38440707201648483.
[I 2024-10-23 17:34:57,372] Trial 2 finished with value: 0.34560453284100395 and parameters: {'iterations': 87, 'depth': 8, 'learning_rate': 0.1327685470388916, 'l2_leaf_reg': 1.045867323217618, 'bootstrap_type': 'Bayesian', 'random_strength': 2.893434682492068, 'border_count': 98, 'random_seed': 2024, 'bagging_temperature': 0.7979233971149834}. Best is trial 0 with value: 0.38440707201648483.
[I 2024-10-23 17:35:36,595] Trial 3 finished with value: 0.3812228767182266 and parameters: {'iterations': 82, 'depth': 6, 'learning_rate': 0.152087590757984, 'l2_leaf_reg': 2.929691145924111, 'bootstrap_type': 'Bayesian', 'random_strength': 4.433444876033651, 'border_count': 113, 'random_seed': 2024, 'bagging_temperature': 0.2892811403327177}. Best is trial 0 with value: 0.38440707201648483.
[I 2024-10-23 17:37:57,313] Trial 4 finished with value: 0.3361079064881016 and parameters: {'iterations': 92, 'depth': 8, 'learning_rate': 0.15824656330287373, 'l2_leaf_reg': 2.335110799201222, 'bootstrap_type': 'Bayesian', 'random_strength': 6.0046836005178665, 'border_count': 107, 'random_seed': 2024, 'bagging_temperature': 0.8924863863290551}. Best is trial 0 with value: 0.38440707201648483.
0: learn: 1.1677280 total: 538ms remaining: 44.6s 1: learn: 1.0233258 total: 1.12s remaining: 45.8s 2: learn: 0.9224457 total: 1.6s remaining: 43.1s 3: learn: 0.8352970 total: 2.22s remaining: 44.3s 4: learn: 0.7686503 total: 2.86s remaining: 45.2s 5: learn: 0.7108422 total: 3.61s remaining: 47s 6: learn: 0.6622692 total: 4.23s remaining: 46.5s 7: learn: 0.6175089 total: 4.81s remaining: 45.7s 8: learn: 0.5833032 total: 5.4s remaining: 45s 9: learn: 0.5526761 total: 5.99s remaining: 44.3s 10: learn: 0.5284835 total: 6.58s remaining: 43.7s 11: learn: 0.5088620 total: 7.24s remaining: 43.4s 12: learn: 0.4912928 total: 7.85s remaining: 42.9s 13: learn: 0.4739903 total: 8.46s remaining: 42.3s 14: learn: 0.4575285 total: 9.04s remaining: 41.6s 15: learn: 0.4465080 total: 9.62s remaining: 40.9s 16: learn: 0.4337611 total: 10.3s remaining: 40.5s 17: learn: 0.4218666 total: 10.8s remaining: 39.7s 18: learn: 0.4107497 total: 11.4s remaining: 39s 19: learn: 0.4018927 total: 12s remaining: 38.4s 20: learn: 0.3935956 total: 12.6s remaining: 37.7s 21: learn: 0.3854994 total: 13.1s remaining: 37s 22: learn: 0.3794979 total: 13.7s remaining: 36.3s 23: learn: 0.3720972 total: 14.3s remaining: 35.7s 24: learn: 0.3636630 total: 15s remaining: 35.4s 25: learn: 0.3577149 total: 15.7s remaining: 35s 26: learn: 0.3547073 total: 16.4s remaining: 34.6s 27: learn: 0.3498429 total: 17.1s remaining: 34.1s 28: learn: 0.3454957 total: 17.8s remaining: 33.8s 29: learn: 0.3425081 total: 18.5s remaining: 33.4s 30: learn: 0.3397684 total: 19.4s remaining: 33.2s 31: learn: 0.3341317 total: 20.5s remaining: 33.3s 32: learn: 0.3300049 total: 21.2s remaining: 32.8s 33: learn: 0.3275224 total: 21.8s remaining: 32.1s 34: learn: 0.3238664 total: 22.4s remaining: 31.4s 35: learn: 0.3202316 total: 23s remaining: 30.7s 36: learn: 0.3179103 total: 23.6s remaining: 29.9s 37: learn: 0.3154047 total: 24.1s remaining: 29.2s 38: learn: 0.3135254 total: 24.7s remaining: 28.5s 39: learn: 0.3118519 total: 25.4s remaining: 27.9s 40: learn: 0.3091185 total: 26.1s remaining: 27.3s 41: learn: 0.3048808 total: 26.8s remaining: 26.8s 42: learn: 0.3029744 total: 27.5s remaining: 26.2s 43: learn: 0.3002069 total: 28.2s remaining: 25.6s 44: learn: 0.2963878 total: 29s remaining: 25.1s 45: learn: 0.2955125 total: 29.5s remaining: 24.4s 46: learn: 0.2950047 total: 29.9s remaining: 23.5s 47: learn: 0.2940622 total: 30.5s remaining: 22.9s 48: learn: 0.2924197 total: 31.1s remaining: 22.2s 49: learn: 0.2911538 total: 31.6s remaining: 21.5s 50: learn: 0.2891470 total: 32.2s remaining: 20.8s 51: learn: 0.2879441 total: 32.8s remaining: 20.2s 52: learn: 0.2861317 total: 33.4s remaining: 19.5s 53: learn: 0.2854343 total: 34s remaining: 18.9s 54: learn: 0.2839754 total: 34.5s remaining: 18.2s 55: learn: 0.2826004 total: 35.1s remaining: 17.6s 56: learn: 0.2809507 total: 36.3s remaining: 17.2s 57: learn: 0.2801611 total: 37.2s remaining: 16.7s 58: learn: 0.2778727 total: 37.9s remaining: 16.1s 59: learn: 0.2769787 total: 38.5s remaining: 15.4s 60: learn: 0.2754694 total: 39.3s remaining: 14.8s 61: learn: 0.2744866 total: 40.2s remaining: 14.3s 62: learn: 0.2733779 total: 40.8s remaining: 13.6s 63: learn: 0.2723974 total: 41.3s remaining: 12.9s 64: learn: 0.2711374 total: 42s remaining: 12.3s 65: learn: 0.2692392 total: 42.9s remaining: 11.7s 66: learn: 0.2679932 total: 43.5s remaining: 11s 67: learn: 0.2675762 total: 44.1s remaining: 10.4s 68: learn: 0.2664956 total: 44.6s remaining: 9.7s 69: learn: 0.2654207 total: 45.4s remaining: 9.08s 70: learn: 0.2644709 total: 46.2s remaining: 8.46s 71: learn: 0.2639310 total: 46.9s remaining: 7.82s 72: learn: 0.2632427 total: 47.5s remaining: 7.16s 73: learn: 0.2621599 total: 48.3s remaining: 6.53s 74: learn: 0.2612547 total: 49.2s remaining: 5.9s 75: learn: 0.2606603 total: 49.8s remaining: 5.24s 76: learn: 0.2602171 total: 50.4s remaining: 4.58s 77: learn: 0.2593541 total: 51.1s remaining: 3.93s 78: learn: 0.2586285 total: 52s remaining: 3.29s 79: learn: 0.2567037 total: 53.5s remaining: 2.67s 80: learn: 0.2559769 total: 54.5s remaining: 2.02s 81: learn: 0.2548979 total: 55.4s remaining: 1.35s 82: learn: 0.2541284 total: 56s remaining: 675ms 83: learn: 0.2536267 total: 56.6s remaining: 0us
[I 2024-10-23 17:38:55,136] A new study created in memory with name: no-name-acf218df-fda9-43c9-9df2-3053e8ce69b7
Running CAT with - MI_features
[I 2024-10-23 17:40:48,543] Trial 0 finished with value: 0.40825074006423584 and parameters: {'iterations': 84, 'depth': 8, 'learning_rate': 0.12296210782212309, 'l2_leaf_reg': 0.5337047810939617, 'bootstrap_type': 'Bayesian', 'random_strength': 7.299677422476102, 'border_count': 87, 'random_seed': 2024, 'bagging_temperature': 0.4738457034082185}. Best is trial 0 with value: 0.40825074006423584.
[I 2024-10-23 17:41:32,099] Trial 1 finished with value: 0.39577163823975137 and parameters: {'iterations': 78, 'depth': 5, 'learning_rate': 0.2286023354617474, 'l2_leaf_reg': 6.064240536180453, 'bootstrap_type': 'Bayesian', 'random_strength': 6.105633231254895, 'border_count': 58, 'random_seed': 2024, 'bagging_temperature': 0.22535416319269552}. Best is trial 0 with value: 0.40825074006423584.
[I 2024-10-23 17:43:22,633] Trial 2 finished with value: 0.36876867748199216 and parameters: {'iterations': 87, 'depth': 8, 'learning_rate': 0.1327685470388916, 'l2_leaf_reg': 1.045867323217618, 'bootstrap_type': 'Bayesian', 'random_strength': 2.893434682492068, 'border_count': 98, 'random_seed': 2024, 'bagging_temperature': 0.7979233971149834}. Best is trial 0 with value: 0.40825074006423584.
[I 2024-10-23 17:44:05,584] Trial 3 finished with value: 0.40728893721284903 and parameters: {'iterations': 82, 'depth': 6, 'learning_rate': 0.152087590757984, 'l2_leaf_reg': 2.929691145924111, 'bootstrap_type': 'Bayesian', 'random_strength': 4.433444876033651, 'border_count': 113, 'random_seed': 2024, 'bagging_temperature': 0.2892811403327177}. Best is trial 0 with value: 0.40825074006423584.
[I 2024-10-23 17:46:10,647] Trial 4 finished with value: 0.3612604859095497 and parameters: {'iterations': 92, 'depth': 8, 'learning_rate': 0.15824656330287373, 'l2_leaf_reg': 2.335110799201222, 'bootstrap_type': 'Bayesian', 'random_strength': 6.0046836005178665, 'border_count': 107, 'random_seed': 2024, 'bagging_temperature': 0.8924863863290551}. Best is trial 0 with value: 0.40825074006423584.
0: learn: 1.1730609 total: 486ms remaining: 40.3s 1: learn: 1.0342067 total: 949ms remaining: 38.9s 2: learn: 0.9247563 total: 1.48s remaining: 39.9s 3: learn: 0.8457534 total: 2.09s remaining: 41.9s 4: learn: 0.7796354 total: 2.66s remaining: 42.1s 5: learn: 0.7290371 total: 3.19s remaining: 41.5s 6: learn: 0.6850044 total: 3.71s remaining: 40.8s 7: learn: 0.6478551 total: 4.17s remaining: 39.6s 8: learn: 0.6138447 total: 4.65s remaining: 38.8s 9: learn: 0.5842388 total: 5.19s remaining: 38.4s 10: learn: 0.5664373 total: 5.55s remaining: 36.8s 11: learn: 0.5420530 total: 6.08s remaining: 36.5s 12: learn: 0.5215885 total: 6.54s remaining: 35.7s 13: learn: 0.5042014 total: 7.01s remaining: 35.1s 14: learn: 0.4867716 total: 7.5s remaining: 34.5s 15: learn: 0.4715421 total: 7.98s remaining: 33.9s 16: learn: 0.4562622 total: 8.51s remaining: 33.5s 17: learn: 0.4474551 total: 9s remaining: 33s 18: learn: 0.4359403 total: 9.46s remaining: 32.4s 19: learn: 0.4272344 total: 9.95s remaining: 31.8s 20: learn: 0.4190076 total: 10.4s remaining: 31.3s 21: learn: 0.4132334 total: 10.9s remaining: 30.8s 22: learn: 0.4021469 total: 11.4s remaining: 30.2s 23: learn: 0.3935058 total: 11.9s remaining: 29.7s 24: learn: 0.3867254 total: 12.4s remaining: 29.4s 25: learn: 0.3808787 total: 12.9s remaining: 28.8s 26: learn: 0.3756339 total: 13.4s remaining: 28.3s 27: learn: 0.3694744 total: 13.9s remaining: 27.8s 28: learn: 0.3642195 total: 14.4s remaining: 27.3s 29: learn: 0.3631619 total: 14.8s remaining: 26.7s 30: learn: 0.3584025 total: 15.4s remaining: 26.3s 31: learn: 0.3536884 total: 15.9s remaining: 25.9s 32: learn: 0.3511280 total: 16.4s remaining: 25.3s 33: learn: 0.3466364 total: 16.8s remaining: 24.8s 34: learn: 0.3433440 total: 17.3s remaining: 24.2s 35: learn: 0.3408153 total: 17.8s remaining: 23.7s 36: learn: 0.3391054 total: 18.3s remaining: 23.2s 37: learn: 0.3379596 total: 18.8s remaining: 22.7s 38: learn: 0.3350491 total: 19.3s remaining: 22.2s 39: learn: 0.3331631 total: 19.8s remaining: 21.7s 40: learn: 0.3311541 total: 20.2s remaining: 21.2s 41: learn: 0.3282978 total: 20.7s remaining: 20.7s 42: learn: 0.3250999 total: 21.2s remaining: 20.2s 43: learn: 0.3233945 total: 21.7s remaining: 19.7s 44: learn: 0.3206411 total: 22.2s remaining: 19.3s 45: learn: 0.3197494 total: 22.8s remaining: 18.8s 46: learn: 0.3182101 total: 23.3s remaining: 18.3s 47: learn: 0.3168213 total: 23.8s remaining: 17.8s 48: learn: 0.3155018 total: 24.3s remaining: 17.3s 49: learn: 0.3131452 total: 24.7s remaining: 16.8s 50: learn: 0.3119475 total: 25.2s remaining: 16.3s 51: learn: 0.3106635 total: 25.7s remaining: 15.8s 52: learn: 0.3095831 total: 26.4s remaining: 15.4s 53: learn: 0.3079553 total: 27.1s remaining: 15.1s 54: learn: 0.3077176 total: 27.7s remaining: 14.6s 55: learn: 0.3069242 total: 28.2s remaining: 14.1s 56: learn: 0.3058729 total: 28.7s remaining: 13.6s 57: learn: 0.3046885 total: 29.3s remaining: 13.1s 58: learn: 0.3041026 total: 29.8s remaining: 12.6s 59: learn: 0.3028448 total: 30.3s remaining: 12.1s 60: learn: 0.3016953 total: 30.9s remaining: 11.7s 61: learn: 0.3008269 total: 31.5s remaining: 11.2s 62: learn: 0.3002449 total: 32.1s remaining: 10.7s 63: learn: 0.2995859 total: 32.6s remaining: 10.2s 64: learn: 0.2980209 total: 33.1s remaining: 9.69s 65: learn: 0.2972820 total: 33.6s remaining: 9.17s 66: learn: 0.2953970 total: 34.1s remaining: 8.65s 67: learn: 0.2943505 total: 34.6s remaining: 8.13s 68: learn: 0.2938072 total: 35s remaining: 7.62s 69: learn: 0.2927706 total: 35.6s remaining: 7.11s 70: learn: 0.2908826 total: 36s remaining: 6.6s 71: learn: 0.2886229 total: 36.6s remaining: 6.1s 72: learn: 0.2878116 total: 37.2s remaining: 5.6s 73: learn: 0.2859188 total: 37.7s remaining: 5.09s 74: learn: 0.2843453 total: 38.2s remaining: 4.58s 75: learn: 0.2839863 total: 38.8s remaining: 4.09s 76: learn: 0.2824572 total: 39.5s remaining: 3.6s 77: learn: 0.2815902 total: 40.1s remaining: 3.08s 78: learn: 0.2812899 total: 40.6s remaining: 2.57s 79: learn: 0.2800560 total: 41.1s remaining: 2.06s 80: learn: 0.2785200 total: 41.6s remaining: 1.54s 81: learn: 0.2781103 total: 42.4s remaining: 1.03s 82: learn: 0.2775323 total: 43.1s remaining: 519ms 83: learn: 0.2771441 total: 43.6s remaining: 0us ---------------------------------------------------- CPU times: total: 1d 5h 8min 44s Wall time: 8h 15min 7s
metrics = ['Accuracy', 'Precision', 'Recall', 'F1-Score', 'TimeElapsed', 'Optuna_TimeElapsed']
for metric in metrics:
data = []
for technique in feature_set_result:
for model in feature_set_result[technique]:
data.append({
'Feature_Set': technique,
'Model': model,
'Score': feature_set_result[technique][model][metric]
})
metric_df = pd.DataFrame(data)
plt.figure(figsize=(10, 6))
sns.barplot(x='Feature_Set', y='Score', hue='Model', data=metric_df, palette='Set2')
if metric == 'TimeElapsed':
plt.title(f'Comparison of Time Elapsed Across Models and Feature Sets')
plt.ylabel('Time Elapsed (seconds)')
elif metric == 'Optuna_TimeElapsed':
plt.title(f'Comparison of Optuna Time Elapsed Across Models and Feature Sets')
plt.ylabel('Optuna Time Elapsed (seconds)')
else:
plt.title(f'Comparison of {metric} Across Models and Feature Sets')
plt.ylabel(f'{metric.capitalize()} Score')
plt.xlabel('Feature Selection Technique')
for p in plt.gca().patches:
plt.gca().annotate(f'{p.get_height():.2f}',
(p.get_x() + p.get_width() / 2., p.get_height()),
ha='center', va='baseline', fontsize=12, color='black',
xytext=(0, 5), textcoords='offset points')
plt.legend(title='Model')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
feature_importances¶
num_models = len(best_models)
num_techniques = len(feature_selection_techniques)
total_plots = num_models * num_techniques
rows = (total_plots // 3) + (total_plots % 3 > 0)
cols = min(total_plots, 3)
fig, ax = plt.subplots(nrows=rows, ncols=cols, figsize=(18, 12))
ax = ax.flatten()
plot_idx = 0
for model in best_models:
for technique in feature_selection_techniques:
features = feature_selection_techniques[technique]
importances_score = best_models[model][technique].feature_importances_
df_feature_importances = pd.Series(index=features, data=importances_score)
df_feature_importances = df_feature_importances.sort_values(ascending=False)
sns.barplot(x=df_feature_importances.index, y=df_feature_importances.values, ax=ax[plot_idx])
ax[plot_idx].set_title(f"Feature Importance - {model} - {technique}")
ax[plot_idx].set_xlabel('Features')
ax[plot_idx].set_ylabel('Importance')
ax[plot_idx].tick_params(axis='x', rotation=90)
plot_idx += 1
plt.tight_layout()
plt.show()
Summary of Models Evaluation Results¶
XGBoost¶
XGBoost consistently achieved high performance across different feature sets, with accuracy, precision, recall, and F1-score all exceeding 0.9 on the validation data. This indicates its strong ability to generalize and handle the classification task effectively. While XGBoost required more time to train compared to other models, the performance gains make it the best option.
LightGBM¶
LightGBM also performed well, delivering solid results with high accuracy and recall. Its shorter training time makes it a fast alternative, though it did not quite reach the same level of performance as XGBoost.
CatBoost¶
CatBoost had the shortest training time, but its overall performance lagged behind both XGBoost and LightGBM, making it the least favorable model for this task.
Conclusion¶
XGBoost with Boruta features proved to be the most reliable and effective model, surpassing 0.9 in accuracy, precision, recall, and F1-score across the validation data. Given these results, we will proceed to retrain XGBoost with Boruta features using Optuna, increasing the number of trials (20) to further fine-tune and optimize its performance.
%%time
training_time = {}
trials = 20
study = optuna.create_study(direction='maximize', sampler=optuna.samplers.TPESampler(seed=RANDOM_STATE))
start_time = time.time()
study.optimize(lambda trial: objective(trial, X_train[boruta_selected_features], y_train, xgb.XGBClassifier()), n_trials=trials, n_jobs=1)
end_time = time.time()
training_time['Optuna'] = end_time - start_time
best_params = study.best_trial.params
best_model = xgb.XGBClassifier(**best_params)
sample_weight = compute_sample_weight('balanced', y_train)
start_time = time.time()
best_model.fit(X_train[boruta_selected_features], y_train, sample_weight=sample_weight)
end_time = time.time()
training_time[ "Fitting model"] = end_time - start_time
y_pred = best_model.predict(X_test[boruta_selected_features])
[I 2024-10-24 17:17:04,056] A new study created in memory with name: no-name-c66d1984-342c-4086-9688-0f70ad9277bc
[I 2024-10-24 17:31:03,085] Trial 0 finished with value: 0.38226456908590534 and parameters: {'eval_metric': 'mlogloss', 'n_estimators': 67, 'max_depth': 5, 'learning_rate': 0.14100379047885775, 'subsample': 0.7212125748952531, 'colsample_bynode': 0.845448028736891, 'colsample_bylevel': 0.8358801047050284, 'colsample_bytree': 0.7947691406816437, 'reg_lambda': 0.5034662420322742, 'reg_alpha': 0.11719625308521943, 'min_child_weight': 2.3814958430214483, 'max_delta_step': 2.0061213475203163, 'booster': 'dart', 'objective': 'multi:softmax', 'num_class': 4, 'random_state': 2024, 'nthread': 4, 'n_jobs': 1}. Best is trial 0 with value: 0.38226456908590534.
[I 2024-10-24 17:54:58,355] Trial 1 finished with value: 0.9881188472914543 and parameters: {'eval_metric': 'auc', 'n_estimators': 84, 'max_depth': 7, 'learning_rate': 0.1450708326385391, 'subsample': 0.8340348593785392, 'colsample_bynode': 0.8471533184903827, 'colsample_bylevel': 0.7515991276156387, 'colsample_bytree': 0.7191084307720731, 'reg_lambda': 0.9648187680130099, 'reg_alpha': 0.32659055809120996, 'min_child_weight': 1.205412798609108, 'max_delta_step': 2.420634836656963, 'booster': 'dart', 'objective': 'multi:softmax', 'num_class': 4, 'random_state': 2024, 'nthread': 4, 'n_jobs': 1}. Best is trial 1 with value: 0.9881188472914543.
[I 2024-10-24 18:13:13,950] Trial 2 finished with value: 0.9858997093684756 and parameters: {'eval_metric': 'auc', 'n_estimators': 75, 'max_depth': 6, 'learning_rate': 0.1571654776954366, 'subsample': 0.8480536306281314, 'colsample_bynode': 0.7477973664871458, 'colsample_bylevel': 0.7875443409299727, 'colsample_bytree': 0.8767077405553172, 'reg_lambda': 0.3603530262994459, 'reg_alpha': 0.8060561713786789, 'min_child_weight': 2.397384141933976, 'max_delta_step': 1.5444634623360838, 'booster': 'dart', 'objective': 'multi:softmax', 'num_class': 4, 'random_state': 2024, 'nthread': 4, 'n_jobs': 1}. Best is trial 1 with value: 0.9881188472914543.
[I 2024-10-24 18:26:11,550] Trial 3 finished with value: 0.2650465790535372 and parameters: {'eval_metric': 'mlogloss', 'n_estimators': 62, 'max_depth': 7, 'learning_rate': 0.26746474446216606, 'subsample': 0.8784972772658111, 'colsample_bynode': 0.7401054887766454, 'colsample_bylevel': 0.8004790468730579, 'colsample_bytree': 0.879076368892255, 'reg_lambda': 0.3303288382494436, 'reg_alpha': 0.8805091086696613, 'min_child_weight': 0.5412198370415773, 'max_delta_step': 1.8812423859276484, 'booster': 'dart', 'objective': 'multi:softmax', 'num_class': 4, 'random_state': 2024, 'nthread': 4, 'n_jobs': 1}. Best is trial 1 with value: 0.9881188472914543.
[I 2024-10-24 18:41:03,595] Trial 4 finished with value: 0.2909791570453547 and parameters: {'eval_metric': 'mlogloss', 'n_estimators': 70, 'max_depth': 5, 'learning_rate': 0.28042094091723957, 'subsample': 0.8748079600748659, 'colsample_bynode': 0.7327334581105734, 'colsample_bylevel': 0.8999482613372792, 'colsample_bytree': 0.7693607940693447, 'reg_lambda': 0.3815903434097426, 'reg_alpha': 0.8623936188913988, 'min_child_weight': 2.7005777565812403, 'max_delta_step': 2.191396628866615, 'booster': 'dart', 'objective': 'multi:softmax', 'num_class': 4, 'random_state': 2024, 'nthread': 4, 'n_jobs': 1}. Best is trial 1 with value: 0.9881188472914543.
[I 2024-10-24 19:09:22,592] Trial 5 finished with value: 0.25456705874655905 and parameters: {'eval_metric': 'mlogloss', 'n_estimators': 88, 'max_depth': 9, 'learning_rate': 0.16228573177239408, 'subsample': 0.8010461081698292, 'colsample_bynode': 0.8698007575766158, 'colsample_bylevel': 0.7587031265395159, 'colsample_bytree': 0.8354239101309623, 'reg_lambda': 0.4788157619774831, 'reg_alpha': 0.7135414422664076, 'min_child_weight': 1.0530699737162201, 'max_delta_step': 1.8724942460821676, 'booster': 'dart', 'objective': 'multi:softmax', 'num_class': 4, 'random_state': 2024, 'nthread': 4, 'n_jobs': 1}. Best is trial 1 with value: 0.9881188472914543.
[I 2024-10-24 19:29:44,313] Trial 6 finished with value: 0.9878445459689275 and parameters: {'eval_metric': 'auc', 'n_estimators': 80, 'max_depth': 6, 'learning_rate': 0.2575041621200346, 'subsample': 0.7337738615129971, 'colsample_bynode': 0.8172717227979691, 'colsample_bylevel': 0.7862421341431735, 'colsample_bytree': 0.7123820370391638, 'reg_lambda': 0.3605092972246944, 'reg_alpha': 0.7607308585115216, 'min_child_weight': 1.2216386373611865, 'max_delta_step': 1.475995285326252, 'booster': 'dart', 'objective': 'multi:softmax', 'num_class': 4, 'random_state': 2024, 'nthread': 4, 'n_jobs': 1}. Best is trial 1 with value: 0.9881188472914543.
[I 2024-10-24 19:45:35,667] Trial 7 finished with value: 0.3024047377452335 and parameters: {'eval_metric': 'mlogloss', 'n_estimators': 73, 'max_depth': 5, 'learning_rate': 0.24233284318934117, 'subsample': 0.8745417286181747, 'colsample_bynode': 0.8186312730850103, 'colsample_bylevel': 0.8389425757772428, 'colsample_bytree': 0.7346466640586605, 'reg_lambda': 0.5785593330742154, 'reg_alpha': 0.8838627585910562, 'min_child_weight': 2.6027256701341583, 'max_delta_step': 2.930138850054874, 'booster': 'dart', 'objective': 'multi:softmax', 'num_class': 4, 'random_state': 2024, 'nthread': 4, 'n_jobs': 1}. Best is trial 1 with value: 0.9881188472914543.
[I 2024-10-24 20:09:27,971] Trial 8 finished with value: 0.9893716866292657 and parameters: {'eval_metric': 'auc', 'n_estimators': 85, 'max_depth': 7, 'learning_rate': 0.22332739921646455, 'subsample': 0.7279866486583308, 'colsample_bynode': 0.7822471632439039, 'colsample_bylevel': 0.8555260678875416, 'colsample_bytree': 0.8879451039217905, 'reg_lambda': 0.19412146736509212, 'reg_alpha': 0.9446339775786664, 'min_child_weight': 2.4934679292732183, 'max_delta_step': 1.327006796682491, 'booster': 'dart', 'objective': 'multi:softmax', 'num_class': 4, 'random_state': 2024, 'nthread': 4, 'n_jobs': 1}. Best is trial 8 with value: 0.9893716866292657.
[I 2024-10-24 20:26:16,472] Trial 9 finished with value: 0.9912147694546469 and parameters: {'eval_metric': 'auc', 'n_estimators': 67, 'max_depth': 9, 'learning_rate': 0.26401129744062757, 'subsample': 0.8473317755205418, 'colsample_bynode': 0.8693936461004659, 'colsample_bylevel': 0.8218231247532671, 'colsample_bytree': 0.7688466011255816, 'reg_lambda': 0.30670209229874446, 'reg_alpha': 0.9501632096075534, 'min_child_weight': 1.2293142773561567, 'max_delta_step': 1.5251114649517818, 'booster': 'dart', 'objective': 'multi:softmax', 'num_class': 4, 'random_state': 2024, 'nthread': 4, 'n_jobs': 1}. Best is trial 9 with value: 0.9912147694546469.
[I 2024-10-24 20:58:15,614] Trial 10 finished with value: 0.9909616701341819 and parameters: {'eval_metric': 'auc', 'n_estimators': 93, 'max_depth': 9, 'learning_rate': 0.2988163948036256, 'subsample': 0.7744075991243188, 'colsample_bynode': 0.8858058991322473, 'colsample_bylevel': 0.7027196288800658, 'colsample_bytree': 0.7626254307744855, 'reg_lambda': 0.7099610989707028, 'reg_alpha': 0.5373602864876568, 'min_child_weight': 1.8577642969950348, 'max_delta_step': 0.5129580393129478, 'booster': 'dart', 'objective': 'multi:softmax', 'num_class': 4, 'random_state': 2024, 'nthread': 4, 'n_jobs': 1}. Best is trial 9 with value: 0.9912147694546469.
[I 2024-10-24 21:33:28,591] Trial 11 finished with value: 0.9912221691037868 and parameters: {'eval_metric': 'auc', 'n_estimators': 98, 'max_depth': 9, 'learning_rate': 0.29487506519706436, 'subsample': 0.7810327727954102, 'colsample_bynode': 0.898120876774128, 'colsample_bylevel': 0.7176875810430865, 'colsample_bytree': 0.7630714493366861, 'reg_lambda': 0.7053144541694071, 'reg_alpha': 0.5136639848280791, 'min_child_weight': 1.915462286347568, 'max_delta_step': 0.5509085497851001, 'booster': 'dart', 'objective': 'multi:softmax', 'num_class': 4, 'random_state': 2024, 'nthread': 4, 'n_jobs': 1}. Best is trial 11 with value: 0.9912221691037868.
[I 2024-10-24 22:07:58,861] Trial 12 finished with value: 0.9903743193075278 and parameters: {'eval_metric': 'auc', 'n_estimators': 100, 'max_depth': 8, 'learning_rate': 0.199379578399092, 'subsample': 0.7865671896440775, 'colsample_bynode': 0.8998669596216515, 'colsample_bylevel': 0.7013775669664997, 'colsample_bytree': 0.8143973347495697, 'reg_lambda': 0.7475276287469181, 'reg_alpha': 0.5533626777339511, 'min_child_weight': 1.7950770624471184, 'max_delta_step': 0.769348905646069, 'booster': 'dart', 'objective': 'multi:softmax', 'num_class': 4, 'random_state': 2024, 'nthread': 4, 'n_jobs': 1}. Best is trial 11 with value: 0.9912221691037868.
[I 2024-10-24 22:45:39,736] Trial 13 finished with value: 0.9912158449453436 and parameters: {'eval_metric': 'auc', 'n_estimators': 100, 'max_depth': 8, 'learning_rate': 0.29236361038558345, 'subsample': 0.8317135778689105, 'colsample_bynode': 0.8653307844298959, 'colsample_bylevel': 0.7488044557291498, 'colsample_bytree': 0.7559319450726714, 'reg_lambda': 0.8300180530697104, 'reg_alpha': 0.5638257439440392, 'min_child_weight': 1.5184440048406134, 'max_delta_step': 1.0533040953204575, 'booster': 'dart', 'objective': 'multi:softmax', 'num_class': 4, 'random_state': 2024, 'nthread': 4, 'n_jobs': 1}. Best is trial 11 with value: 0.9912221691037868.
[I 2024-10-24 23:20:42,343] Trial 14 finished with value: 0.9910655071328601 and parameters: {'eval_metric': 'auc', 'n_estimators': 98, 'max_depth': 8, 'learning_rate': 0.29935544761584315, 'subsample': 0.7614200118462342, 'colsample_bynode': 0.8532107311224439, 'colsample_bylevel': 0.7398541239598725, 'colsample_bytree': 0.7443477858733684, 'reg_lambda': 0.9343818529072998, 'reg_alpha': 0.5630358087685485, 'min_child_weight': 2.0989075635633436, 'max_delta_step': 0.9690187706405073, 'booster': 'dart', 'objective': 'multi:softmax', 'num_class': 4, 'random_state': 2024, 'nthread': 4, 'n_jobs': 1}. Best is trial 11 with value: 0.9912221691037868.
[I 2024-10-24 23:52:52,318] Trial 15 finished with value: 0.9902917467792978 and parameters: {'eval_metric': 'auc', 'n_estimators': 93, 'max_depth': 8, 'learning_rate': 0.20375529038977508, 'subsample': 0.8264909873659759, 'colsample_bynode': 0.7061908049104075, 'colsample_bylevel': 0.7293990815637788, 'colsample_bytree': 0.7996337930861529, 'reg_lambda': 0.7911814717451944, 'reg_alpha': 0.3946423969940382, 'min_child_weight': 1.514977742714226, 'max_delta_step': 1.0402396964053904, 'booster': 'dart', 'objective': 'multi:softmax', 'num_class': 4, 'random_state': 2024, 'nthread': 4, 'n_jobs': 1}. Best is trial 11 with value: 0.9912221691037868.
[I 2024-10-25 00:25:55,300] Trial 16 finished with value: 0.9897669933211359 and parameters: {'eval_metric': 'auc', 'n_estimators': 96, 'max_depth': 8, 'learning_rate': 0.23022492604072695, 'subsample': 0.8143583682434862, 'colsample_bynode': 0.8971797874115994, 'colsample_bylevel': 0.7214688435847043, 'colsample_bytree': 0.7015448495143646, 'reg_lambda': 0.8517762135639828, 'reg_alpha': 0.4089027486623522, 'min_child_weight': 2.0478106560267944, 'max_delta_step': 0.5257675996988145, 'booster': 'dart', 'objective': 'multi:softmax', 'num_class': 4, 'random_state': 2024, 'nthread': 4, 'n_jobs': 1}. Best is trial 11 with value: 0.9912221691037868.
[I 2024-10-25 00:58:12,410] Trial 17 finished with value: 0.9899366324375621 and parameters: {'eval_metric': 'auc', 'n_estimators': 92, 'max_depth': 9, 'learning_rate': 0.1000604781764893, 'subsample': 0.7531881336011558, 'colsample_bynode': 0.7890525105838824, 'colsample_bylevel': 0.7634305451882369, 'colsample_bytree': 0.838646311730019, 'reg_lambda': 0.6321978979681996, 'reg_alpha': 0.632063136342995, 'min_child_weight': 2.9738999962143233, 'max_delta_step': 1.1353190836352227, 'booster': 'dart', 'objective': 'multi:softmax', 'num_class': 4, 'random_state': 2024, 'nthread': 4, 'n_jobs': 1}. Best is trial 11 with value: 0.9912221691037868.
[I 2024-10-25 01:32:55,754] Trial 18 finished with value: 0.9909682511428506 and parameters: {'eval_metric': 'auc', 'n_estimators': 100, 'max_depth': 8, 'learning_rate': 0.2846078484674374, 'subsample': 0.7988982500200076, 'colsample_bynode': 0.823377895365234, 'colsample_bylevel': 0.7182087460288243, 'colsample_bytree': 0.7470669008123725, 'reg_lambda': 0.8577366803218225, 'reg_alpha': 0.20932022345287155, 'min_child_weight': 1.522652889206766, 'max_delta_step': 0.7564213964787009, 'booster': 'dart', 'objective': 'multi:softmax', 'num_class': 4, 'random_state': 2024, 'nthread': 4, 'n_jobs': 1}. Best is trial 11 with value: 0.9912221691037868.
[I 2024-10-25 02:02:33,828] Trial 19 finished with value: 0.9910225704386083 and parameters: {'eval_metric': 'auc', 'n_estimators': 89, 'max_depth': 9, 'learning_rate': 0.19820314479285167, 'subsample': 0.8533803240374401, 'colsample_bynode': 0.8741891955562361, 'colsample_bylevel': 0.7741847956439751, 'colsample_bytree': 0.7763589961184837, 'reg_lambda': 0.6656043139922608, 'reg_alpha': 0.4277981853793561, 'min_child_weight': 0.846094423419322, 'max_delta_step': 0.8027908456686832, 'booster': 'dart', 'objective': 'multi:softmax', 'num_class': 4, 'random_state': 2024, 'nthread': 4, 'n_jobs': 1}. Best is trial 11 with value: 0.9912221691037868.
CPU times: total: 1d 13h 6min 48s Wall time: 11h 35s
check_accuracy(y_test, y_pred, "XGB-model with boruta_selected_features and 20 n_trials")
sns.barplot(training_time, palette='Set2')
plt.title("Training Time in seconds")
plt.show()
Model Comparison with Optuna Hyperparameter Tuning¶
After rerunning XGBoost with Boruta-selected features and Optuna hyperparameter optimization over 20 trials, the model produced excellent results:
- Accuracy: 0.95
- Precision: 0.92
- Recall: 0.95
- F1-Score: 0.93
Despite increasing trials from 5 to 20, performance improvements were marginal, and the extended tuning time reached 11 hours without substantial additional gains.
Confusion Matrix Insights¶
Analysis of the confusion matrix reveals that most misclassifications occur within the "phishing" class, particularly:
- Phishing URLs are sometimes misclassified as "benign"
- Phishing URLs are occasionally mislabeled as "malware"
These patterns suggest that while the model generally performs well, additional feature engineering or targeted optimization may be beneficial to improve its ability to distinguish phishing URLs from other classes, specifically benign URLs and malware.
This summary offers insights for selecting LightGBM for faster tuning without major performance compromises, while identifying opportunities for further improvements in phishing URL detection.
Interpretability with SHAP (SHapley Additive exPlanations)¶
What is SHAP?¶
SHAP is a unified framework for interpreting the output of machine learning models. It provides insight into how each feature in the model contributes to the final prediction, using the concept of Shapley values from cooperative game theory.
Shapley Values:¶
In game theory, Shapley values represent the fair distribution of a reward among players based on their contributions. In the context of machine learning, the “players” are the features of the model, and SHAP assigns each feature a value that reflects its contribution to the model’s prediction.
Why Use SHAP?¶
Explainability: SHAP provides clear explanations of how features contribute to predictions, both globally (across all predictions) and locally (for individual predictions).
Flexibility: SHAP works with various machine learning models (tree-based models, neural networks, etc.) and is model-agnostic, providing a unified framework for understanding feature importance.
Feature Engineering:
- SHAP can help uncover which features are truly important and which features might be redundant or irrelevant.
- By identifying feature interactions and analyzing the importance of different feature combinations, SHAP can guide the creation of new, meaningful features or the removal of noisy ones.
- It highlights features that have non-linear impacts on predictions, revealing areas where additional transformations or domain-specific knowledge could improve model performance.
Trust and Compliance: By understanding why a model makes certain predictions, users can trust the model more. For regulated industries like healthcare and finance, SHAP provides explanations to meet transparency requirements.
Debugging: SHAP can help identify when a model is relying on irrelevant or incorrect features, which can improve model debugging and refinement.
Accuracy: SHAP’s foundation on Shapley values provides accurate and mathematically sound explanations, ensuring that the feature attributions are consistent and unbiased across models.
Why is Model Interpretability Important?¶
- Trust: By understanding why a model makes certain predictions, users can gain more trust in its decisions.
- Debugging: SHAP helps identify issues, such as when a model relies on irrelevant features or biases.
- Compliance: In industries where transparency is crucial (healthcare, finance, cyber security), SHAP provides insights to meet regulatory requirements.
- Model Insights: Understanding feature importance and interactions can provide valuable insights for improving models or business processes.
SHAP in Action¶
Key Benefits:¶
- Model-Agnostic: SHAP works with any machine learning model (tree-based models, neural networks).
- Visualizations: SHAP offers powerful visualization tools like force plots, summary plots, and dependence plots, making it easier to interpret the impact of features.
- Feature Interactions: SHAP can also explain how features interact with each other to immethod for feature
Here we Gonna do SHAP on XGBoost with Boruta-selected features after 20 trials
%%time
explainer = shap.TreeExplainer(best_model)
shap_values = explainer.shap_values(X_test[boruta_selected_features])
CPU times: total: 1h 33min 19s Wall time: 23min 53s
for k, target_label in enumerate(encoding_map.keys()):
print(f"SHAP Summary for '{target_label.capitalize()}'")
print(f"--------------------------------------------------------------------------------------")
shap_values_benign = shap_values[:, :, k]
shap.summary_plot( shap_values_benign, X_test[boruta_selected_features], plot_size=(8,8), )
plt.show()
print("\n\n")
SHAP Summary for 'Benign' --------------------------------------------------------------------------------------
SHAP Summary for 'Phishing' --------------------------------------------------------------------------------------
SHAP Summary for 'Defacement' --------------------------------------------------------------------------------------
SHAP Summary for 'Malware' --------------------------------------------------------------------------------------
shap_values_mean = np.mean(np.abs(shap_values), axis=2)
print("Aggregated SHAP Summary for All Classes")
print(f"--------------------------------------------------------------------------------------")
shap.summary_plot( shap_values_mean, X_test[boruta_selected_features] , plot_size=(10,8), plot_type='bar')
Aggregated SHAP Summary for All Classes --------------------------------------------------------------------------------------
Conclusion on SHAP Analysis¶
Through the SHAP Summary and Aggregated SHAP Summary, we observe that the SHAP values align closely with the model's feature importance rankings. This alignment indicates that all the selected features meaningfully contribute to the model's performance.
In the SHAP Summary, it becomes clear that each class has its own set of important features that assist with classification. For example, features like is_abnormal_url, count_http, and count_// have a significant influence across all classes. Meanwhile, certain features, such as is_abnormal_url, play a more prominent role specifically in identifying "Benign" and "Defacement" cases, while count_subdomain is more important for identifying "Phishing" cases.
This analysis provides deeper insights into how individual features drive model decisions and helps validate our feature selection process. cess.
Deep Learning Models¶
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler().set_output(transform="pandas")
ds = X_train.describe().T
binary_cols = ds[(ds['max'] == 1) & (ds['min'] == 0) & (ds['50%'] == 0)].index
X_train_binary = X_train[binary_cols]
X_test_binary = X_test[binary_cols]
X_train_numerical = X_train.drop(binary_cols, axis=1)
X_test_numerical = X_test.drop(binary_cols, axis=1)
X_train_numerical_scaled = scaler.fit_transform(X_train_numerical)
X_test_numerical_scaled = scaler.transform(X_test_numerical)
X_train_scaled = pd.merge(X_train_numerical_scaled, X_train_binary, left_index=True, right_index=True)
X_test_scaled = pd.merge(X_test_numerical_scaled, X_test_binary, left_index=True, right_index=True)
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.utils.data import DataLoader,TensorDataset, WeightedRandomSampler
def create_fnn_model(input_size: int, output_size: int, layer_units: List[int], activation_functions: List[str],
lr: float, gamma: float, step_size: int, L2lambda: float, dropout_rate: float, doBN: bool) -> Tuple[nn.Module, nn.Module, optim.Optimizer, optim.lr_scheduler.StepLR]:
class FnnModel(nn.Module):
def __init__(self, input_size, output_size, layer_units, activation_functions, dropout_rate, doBN):
super().__init__()
if len(activation_functions) != len(layer_units):
raise ValueError(f"The number of activation functions must match the number of layers: "
f"len of activation_functions: {len(activation_functions)}, "
f"len of layer_units: {len(layer_units)}")
self.layers = nn.ModuleDict()
self.input_size = input_size
self.output_size = output_size
self.nLayers = len(layer_units)
self.layer_units = layer_units
self.dropout_rate = dropout_rate
self.doBN = doBN
self.activation_functions = activation_functions
self.layers['input'] = nn.Linear(self.input_size, self.layer_units[0])
if self.doBN:
self.layers['bn_input'] = nn.BatchNorm1d(self.layer_units[0])
for layer in range(1, self.nLayers):
self.layers[f'hidden_{layer}'] = nn.Linear(self.layer_units[layer - 1], self.layer_units[layer])
if self.doBN:
self.layers[f'bn_{layer}'] = nn.BatchNorm1d(self.layer_units[layer])
if dropout_rate > 0:
self.layers[f'dropout_{layer}'] = nn.Dropout(dropout_rate)
self.layers['output'] = nn.Linear(self.layer_units[-1], self.output_size)
def forward(self, x):
actfun = getattr(F, self.activation_functions[0], None)
if actfun is None:
raise ValueError(f"Activation function '{self.activation_functions[0]}' is not defined.")
x = self.layers['input'](x)
if self.doBN:
x = self.layers['bn_input'](x)
x = actfun(x)
for fc in range(1, self.nLayers):
x = self.layers[f'hidden_{fc}'](x)
if self.doBN:
x = self.layers[f'bn_{fc}'](x)
actfun = getattr(F, self.activation_functions[fc], None)
if actfun is None:
raise ValueError(f"Activation function '{self.activation_functions[fc]}' is not defined.")
x = actfun(x)
if self.dropout_rate > 0:
x = self.layers[f'dropout_{fc}'](x)
x = self.layers['output'](x)
return x
net = FnnModel(input_size, output_size, layer_units, activation_functions, dropout_rate, doBN)
lossfun = nn.CrossEntropyLoss()
optimizer = optim.Adam(net.parameters(), lr=lr, weight_decay=L2lambda)
scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=step_size, gamma=gamma)
return net, lossfun, optimizer, scheduler
def function2trainTheModel(numepochs: int, train_loader: DataLoader, test_loader: DataLoader, net: nn.Module,
lossfun: nn.Module, optimizer: optim.Optimizer, scheduler: optim.lr_scheduler._LRScheduler,
computation_metric: Callable, verbos: bool = 1) -> Tuple[nn.Module, torch.Tensor, List[float], List[float]]:
start_time = time.time()
losses = torch.zeros(numepochs)
trainAcc = []
testAcc = []
for epochi in range(numepochs):
epoch_start_time = time.time()
net.train()
batchAcc = []
batchLoss = []
for X, y in train_loader:
yHat = net(X)
loss = lossfun(yHat, y)
optimizer.zero_grad()
loss.backward()
optimizer.step()
batchLoss.append(loss.item())
batchAcc.append(computation_metric(yHat, y))
scheduler.step()
if epochi % scheduler.step_size == 0 and verbos:
print(f"\n*** Epoch {epochi+1}, Step Size: {epochi}, Learning Rate: {scheduler.get_last_lr()[0]} ***\n")
trainAcc.append(np.mean(batchAcc))
losses[epochi] = np.mean(batchLoss)
epoch_end_time = time.time()
epoch_duration = epoch_end_time - epoch_start_time
if epochi == 1:
estimated_total_time = epoch_duration * numepochs
print(f"\n*** Estimated total time for training: {estimated_total_time:.2f} seconds, {estimated_total_time/60:.2f} minutes. ***\n")
if verbos:
print(f'Epoch {epochi+1}/{numepochs}, Loss: {losses[epochi]:.4f}, elapsed time: {epoch_duration:.2f} sec')
net.eval()
X, y = next(iter(test_loader))
with torch.no_grad():
yHat = net(X)
testAcc.append(computation_metric(yHat, y))
total_time = time.time() - start_time
print(f"Total time elapsed: {total_time:.2f} sec, {total_time/60:.2f} minutes")
return net, losses, trainAcc, testAcc, yHat
def plot_training_metrics(losses, trainAcc, testAcc, metricName)-> None:
fig, ax = plt.subplots(2, 1, figsize=(12,8))
ax[0].plot(losses, label='Losses')
ax[0].set_title("Losses")
ax[0].set_xlabel("Number of epochs")
ax[0].set_ylabel("Loss")
ax[0].legend([f'Loss: {losses[-1]:.3f}'])
ax[1].plot(trainAcc, label='Train')
ax[1].plot(testAcc, label='Test')
ax[1].set_title(f"{metricName}")
ax[1].set_xlabel("Number of epochs")
ax[1].set_ylabel(f"{metricName}")
ax[1].legend([f'Train: {trainAcc[-1]:.3f}', f'Test: {testAcc[-1]:.3f}'])
plt.tight_layout()
plt.show()
Initial Network CONFIG¶
np.random.seed(RANDOM_STATE)
torch.manual_seed(RANDOM_STATE)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
numepochs = 30
step_size = 10
gamma = 0.8
dropout_rate = 0.2
layer_units = [ 128, 256, 512, 256, 128 ]
learningRate = 0.01
L2lambda = 0.0
compute_accuracy_multi = lambda yHat, y: (torch.argmax(torch.softmax(yHat, dim=1), dim=1) == y).float().sum().item() / len(y) * 100
Balancing Data with WeightedRandomSampler in PyTorch¶
To handle class imbalance, we can use WeightedRandomSampler in PyTorch. This sampler assigns weights to each sample, ensuring that minority classes are represented more frequently during training.
- Calculate Sample Weights: Compute weights for each sample using
compute_sample_weightfromsklearn. - Create a
WeightedRandomSampler: Pass the sample weights toWeightedRandomSampler. - Use in DataLoader: Add the sampler to
DataLoaderto balance class representation.
sample_weights = compute_sample_weight(class_weight='balanced', y=y_train)
sample_weights_tensor = torch.tensor(sample_weights, dtype=torch.float32)
sampler = WeightedRandomSampler(weights=sample_weights_tensor, num_samples=len(sample_weights_tensor), replacement=True)
Manuel Hyperparameters Search (Base on Experiment)¶
WorkFlow¶
- Network Structure: Batch Normalization (
True / False) AND Activation Functions ('relu6'or'tanh') - Network Learning: Batch Size (
1024 / 2048 / 4096) AND Learning Rate (0.001 / 0.01) AND Gamma (1.0 / 0.7) - Network Regularization: Dropout Rate (
0.0 / 0.2 / 0.4) - Network Architecture:
Breadth net vs. Depth net
Network Structure¶
%%time
X_train_scaled_tensor = torch.tensor(X_train_scaled.values, dtype=torch.float32)
y_train_tensor = torch.tensor(y_train.values, dtype=torch.long)
X_test_scaled_tensor = torch.tensor(X_test_scaled.values, dtype=torch.float32)
y_test_tensor = torch.tensor(y_test.values, dtype=torch.long)
train_dataDataset = TensorDataset(X_train_scaled_tensor, y_train_tensor)
test_dataDataset = TensorDataset(X_test_scaled_tensor, y_test_tensor)
batch_size = 2048
train_loader = DataLoader(train_dataDataset, batch_size=batch_size, drop_last=True, sampler=sampler)
test_loader = DataLoader(test_dataDataset, batch_size=test_dataDataset.tensors[0].shape[0], shuffle=False)
input_size = train_loader.dataset.tensors[0].shape[1]
output_size = len(y_train.unique())
relu6 = ['relu6', 'relu6', 'relu6', 'relu6', 'relu6']
tanh = ['tanh', 'tanh', 'tanh', 'tanh', 'tanh']
results_dict = {
"relu6_BatchNorm_True": {"Test Accuracy": None, "Time": None},
"relu6_BatchNorm_False": {"Test Accuracy": None, "Time": None},
"tanh_BatchNorm_True": {"Test Accuracy": None, "Time": None},
"tanh_BatchNorm_False": {"Test Accuracy": None, "Time": None},
}
for functions, activation_functions in zip(["relu6", "tanh"], [relu6, tanh]):
for norm in [True, False]:
config = f"{functions}_BatchNorm_{norm}"
print(f"Activation Functions: {functions} | BatchNorm: {norm}")
start_time = time.time()
net, lossfun, optimizer, scheduler = create_fnn_model(
input_size, output_size, layer_units, activation_functions,
learningRate, gamma, step_size, L2lambda, dropout_rate, norm
)
net, losses, trainAcc, testAcc, yHat = function2trainTheModel(
numepochs, train_loader, test_loader, net, lossfun, optimizer, scheduler,
compute_accuracy_multi, verbos=0
)
results_dict[config]["Test Accuracy"] = testAcc[-1]
results_dict[config]["Time"] = time.time() - start_time
print(f"Completed for Activation Functions: {functions} | BatchNorm: {norm}\n")
configs = list(results_dict.keys())
test_accuracies = [metrics["Test Accuracy"] for metrics in results_dict.values()]
times = [metrics["Time"] for metrics in results_dict.values()]
fig, ax = plt.subplots(2, 1, figsize=(14, 7))
sns.barplot(x=configs, y=test_accuracies, ax=ax[0], palette="Set2")
ax[0].set_title('Test Accuracy vs Activation Functions and BatchNorm')
ax[0].set_xlabel('Configurations (Activation Function + BatchNorm)')
ax[0].set_ylabel('Test Accuracy')
ax[0].set_ylim(0, 100)
for p in ax[0].patches:
ax[0].annotate(format(p.get_height(), '.2f'),
(p.get_x() + p.get_width() / 2., p.get_height()),
ha='center', va='center', xytext=(0, 9), textcoords='offset points')
sns.barplot(x=configs, y=times, ax=ax[1], palette="Set2")
ax[1].set_title('Time Taken vs Activation Functions and BatchNorm')
ax[1].set_xlabel('Configurations (Activation Function + BatchNorm)')
ax[1].set_ylabel('Time (seconds)')
ax[1].grid(axis='y')
for p in ax[1].patches:
ax[1].annotate(format(p.get_height(), '.2f'),
(p.get_x() + p.get_width() / 2., p.get_height()),
ha='center', va='center', xytext=(0, 9), textcoords='offset points')
plt.tight_layout()
plt.show()
Activation Functions: relu6 | BatchNorm: True *** Estimated total time for training: 678.90 seconds, 11.32 minutes. *** Total time elapsed: 774.52 sec, 12.91 minutes Completed for Activation Functions: relu6 | BatchNorm: True Activation Functions: relu6 | BatchNorm: False *** Estimated total time for training: 595.95 seconds, 9.93 minutes. *** Total time elapsed: 737.40 sec, 12.29 minutes Completed for Activation Functions: relu6 | BatchNorm: False Activation Functions: tanh | BatchNorm: True *** Estimated total time for training: 667.93 seconds, 11.13 minutes. *** Total time elapsed: 767.24 sec, 12.79 minutes Completed for Activation Functions: tanh | BatchNorm: True Activation Functions: tanh | BatchNorm: False *** Estimated total time for training: 586.91 seconds, 9.78 minutes. *** Total time elapsed: 704.93 sec, 11.75 minutes Completed for Activation Functions: tanh | BatchNorm: False
CPU times: total: 3h 57min 28s Wall time: 49min 45s
Experiment Summary of Results from Activation Functions and Batch Normalization¶
Summary of Batch Size, Learning Rate and Gamma The results show the following key findings:
Batch Normalization (doBN=True) consistently outperformed the models without batch normalization:
relu6_BatchNorm_True: 94.24.%** accuracy in 744 seconds.tanh_BatchNorm_True: 93.95% accuracy in 767 seconds.
Without Batch Normalization (doBN=False):
relu6_BatchNorm_False: 93.66% accuracy in 737 seconds.tanh_BatchNorm_False: 92.34% accuracy in 704.49 seconds.
Conclusion:
- Batch normalization shows a slight improvement over not using batch normalization.
- The difference between ReLU6 and Tanh is relatively small, with no significant advantage of one over the other.
Network Learning¶
%%time
doBN = True
activation_functions = ['relu6']*len(layer_units)
batch_sizes = [1024, 2048, 4096]
learning_rates = [0.001, 0.01]
gammas = [1.0, .7]
step_size = numepochs // 2
results = {}
X_train_scaled_tensor = torch.tensor(X_train_scaled.values, dtype=torch.float32)
y_train_tensor = torch.tensor(y_train.values, dtype=torch.long)
X_test_scaled_tensor = torch.tensor(X_test_scaled.values, dtype=torch.float32)
y_test_tensor = torch.tensor(y_test.values, dtype=torch.long)
train_dataDataset = TensorDataset(X_train_scaled_tensor, y_train_tensor)
test_dataDataset = TensorDataset(X_test_scaled_tensor, y_test_tensor)
input_size = X_train_scaled_tensor.shape[1]
output_size = len(y_train.unique())
for batch_size in batch_sizes:
for lr in learning_rates:
for gamma in gammas:
print(f"\nTesting Batch Size: {batch_size}, Learning Rate: {lr}, Gamma: {gamma}")
train_loader = DataLoader(train_dataDataset, batch_size=batch_size, drop_last=True, sampler=sampler)
test_loader = DataLoader(test_dataDataset, batch_size=len(test_dataDataset), shuffle=False)
start_time = time.time()
net, lossfun, optimizer, scheduler = create_fnn_model(
input_size=input_size, output_size=output_size,
layer_units=layer_units, activation_functions=activation_functions,
lr=lr, gamma=gamma, step_size=step_size,
L2lambda=0, dropout_rate=0, doBN=doBN
)
net, losses, trainAcc, testAcc, yHat = function2trainTheModel(
numepochs=numepochs, train_loader=train_loader,
test_loader=test_loader, net=net, lossfun=lossfun,
optimizer=optimizer, scheduler=scheduler,
computation_metric=compute_accuracy_multi, verbos=0
)
elapsed_time = time.time() - start_time
results[(batch_size, lr, gamma)] = {
"Test Accuracy": testAcc[-1],
"Time": elapsed_time
}
for params, metrics in results.items():
print(f"Batch Size: {params[0]}, LR: {params[1]}, Gamma: {params[2]} --> Accuracy: {metrics['Test Accuracy']:.2f}, Time: {metrics['Time']:.2f} seconds")
configs = [f'BS: {params[0]}, LR: {params[1]}, γ: {params[2]}' for params in results.keys()]
test_accuracies = [metrics["Test Accuracy"] for metrics in results.values()]
times = [metrics["Time"] for metrics in results.values()]
fig, ax = plt.subplots(2, 1, figsize=(14, 7))
sns.barplot(x=configs, y=test_accuracies, ax=ax[0], palette="Set2")
ax[0].set_title('Test Accuracy vs Batch Size, Learning Rate, and Gamma')
ax[0].set_xlabel('Configurations (Batch Size, Learning Rate, Gamma)')
ax[0].set_ylabel('Test Accuracy')
ax[0].set_ylim(0, 100)
for p in ax[0].patches:
ax[0].annotate(format(p.get_height(), '.2f'),
(p.get_x() + p.get_width() / 2., p.get_height()),
ha='center', va='center', xytext=(0, 9), textcoords='offset points')
sns.barplot(x=configs, y=times, ax=ax[1], palette="Set2")
ax[1].set_title('Time Taken vs Batch Size, Learning Rate, and Gamma')
ax[1].set_xlabel('Configurations (Batch Size, Learning Rate, Gamma)')
ax[1].set_ylabel('Time (seconds)')
ax[1].grid(axis='y')
for p in ax[1].patches:
ax[1].annotate(format(p.get_height(), '.2f'),
(p.get_x() + p.get_width() / 2., p.get_height()),
ha='center', va='center', xytext=(0, 9), textcoords='offset points')
plt.tight_layout()
plt.show()
Testing Batch Size: 1024, Learning Rate: 0.001, Gamma: 1.0 *** Estimated total time for training: 1075.80 seconds, 17.93 minutes. *** Total time elapsed: 1250.89 sec, 20.85 minutes Testing Batch Size: 1024, Learning Rate: 0.001, Gamma: 0.7 *** Estimated total time for training: 1072.31 seconds, 17.87 minutes. *** Total time elapsed: 1243.56 sec, 20.73 minutes Testing Batch Size: 1024, Learning Rate: 0.01, Gamma: 1.0 *** Estimated total time for training: 1058.10 seconds, 17.64 minutes. *** Total time elapsed: 1252.66 sec, 20.88 minutes Testing Batch Size: 1024, Learning Rate: 0.01, Gamma: 0.7 *** Estimated total time for training: 1081.19 seconds, 18.02 minutes. *** Total time elapsed: 1263.18 sec, 21.05 minutes Testing Batch Size: 2048, Learning Rate: 0.001, Gamma: 1.0 *** Estimated total time for training: 1027.10 seconds, 17.12 minutes. *** Total time elapsed: 1178.46 sec, 19.64 minutes Testing Batch Size: 2048, Learning Rate: 0.001, Gamma: 0.7 *** Estimated total time for training: 995.41 seconds, 16.59 minutes. *** Total time elapsed: 1175.72 sec, 19.60 minutes Testing Batch Size: 2048, Learning Rate: 0.01, Gamma: 1.0 *** Estimated total time for training: 979.24 seconds, 16.32 minutes. *** Total time elapsed: 1182.46 sec, 19.71 minutes Testing Batch Size: 2048, Learning Rate: 0.01, Gamma: 0.7 *** Estimated total time for training: 1007.06 seconds, 16.78 minutes. *** Total time elapsed: 1186.15 sec, 19.77 minutes Testing Batch Size: 4096, Learning Rate: 0.001, Gamma: 1.0 *** Estimated total time for training: 974.17 seconds, 16.24 minutes. *** Total time elapsed: 1167.93 sec, 19.47 minutes Testing Batch Size: 4096, Learning Rate: 0.001, Gamma: 0.7 *** Estimated total time for training: 1038.73 seconds, 17.31 minutes. *** Total time elapsed: 1174.83 sec, 19.58 minutes Testing Batch Size: 4096, Learning Rate: 0.01, Gamma: 1.0 *** Estimated total time for training: 1031.37 seconds, 17.19 minutes. *** Total time elapsed: 1181.84 sec, 19.70 minutes Testing Batch Size: 4096, Learning Rate: 0.01, Gamma: 0.7 *** Estimated total time for training: 1048.11 seconds, 17.47 minutes. *** Total time elapsed: 1180.69 sec, 19.68 minutes Batch Size: 1024, LR: 0.001, Gamma: 1.0 --> Accuracy: 95.19, Time: 1250.91 seconds Batch Size: 1024, LR: 0.001, Gamma: 0.7 --> Accuracy: 95.19, Time: 1243.56 seconds Batch Size: 1024, LR: 0.01, Gamma: 1.0 --> Accuracy: 94.93, Time: 1252.67 seconds Batch Size: 1024, LR: 0.01, Gamma: 0.7 --> Accuracy: 95.36, Time: 1263.18 seconds Batch Size: 2048, LR: 0.001, Gamma: 1.0 --> Accuracy: 94.98, Time: 1178.48 seconds Batch Size: 2048, LR: 0.001, Gamma: 0.7 --> Accuracy: 94.90, Time: 1175.75 seconds Batch Size: 2048, LR: 0.01, Gamma: 1.0 --> Accuracy: 94.66, Time: 1182.47 seconds Batch Size: 2048, LR: 0.01, Gamma: 0.7 --> Accuracy: 95.34, Time: 1186.15 seconds Batch Size: 4096, LR: 0.001, Gamma: 1.0 --> Accuracy: 95.30, Time: 1167.93 seconds Batch Size: 4096, LR: 0.001, Gamma: 0.7 --> Accuracy: 95.34, Time: 1174.83 seconds Batch Size: 4096, LR: 0.01, Gamma: 1.0 --> Accuracy: 94.64, Time: 1181.84 seconds Batch Size: 4096, LR: 0.01, Gamma: 0.7 --> Accuracy: 94.69, Time: 1180.71 seconds
CPU times: total: 19h 42min 25s Wall time: 4h 1min 12s
Summary of Batch Size, Learning Rate, and Gamma Experiments¶
This experiment tested different Batch Sizes (1024, 2048, 4096), Learning Rates (0.001, 0.01), and Gamma values (1.0, 0.7) to assess their impact on model performance.
Key Findings:¶
- Batch Size 1024 consistently achieved the highest accuracy, with up to 95.36% at LR 0.01, Gamma 0.7.
- Gamma 1.0 generally performed better, especially at LR 0.001.
- Training Time: Smaller batch sizes (1024) took longer but yielded higher accuracy, whereas larger batch sizes (2048, 4096) were faster but slightly less accurate.
Conclusion:¶
The optimal setup was Batch Size 1024, Learning Rate 0.001, and Gamma 1.0 for the best balance of accuracy (95.19%) and efficiency.
Network Regularization¶
%%time
doBN = True
activation_functions = ['relu6']*len(layer_units)
batch_size = 1024
lr = 0.001
gamma = 1.0
dropout_rates = [0.0, 0.2, 0.4]
results = {}
X_train_scaled_tensor = torch.tensor(X_train_scaled.values, dtype=torch.float32)
y_train_tensor = torch.tensor(y_train.values, dtype=torch.long)
X_test_scaled_tensor = torch.tensor(X_test_scaled.values, dtype=torch.float32)
y_test_tensor = torch.tensor(y_test.values, dtype=torch.long)
train_dataDataset = TensorDataset(X_train_scaled_tensor, y_train_tensor)
test_dataDataset = TensorDataset(X_test_scaled_tensor, y_test_tensor)
train_loader = DataLoader(train_dataDataset, batch_size=batch_size, drop_last=True, sampler=sampler)
test_loader = DataLoader(test_dataDataset, batch_size=len(test_dataDataset), shuffle=False)
input_size = X_train_scaled_tensor.shape[1]
output_size = len(y_train.unique())
for dropout_rate in dropout_rates:
print(f"\nTesting Dropout Rate: {dropout_rate}")
start_time = time.time()
net, lossfun, optimizer, scheduler = create_fnn_model(
input_size=input_size, output_size=output_size,
layer_units=layer_units, activation_functions=activation_functions,
lr=lr, gamma=gamma, step_size=numepochs, L2lambda = 0.0,
dropout_rate=dropout_rate, doBN=doBN
)
net, losses, trainAcc, testAcc, yHat = function2trainTheModel(
numepochs=numepochs, train_loader=train_loader,
test_loader=test_loader, net=net, lossfun=lossfun,
optimizer=optimizer, scheduler=scheduler,
computation_metric=compute_accuracy_multi, verbos=0
)
elapsed_time = time.time() - start_time
results[dropout_rate] = {
"Test Accuracy": testAcc[-1],
"Time": elapsed_time
}
for dropout_rate, metrics in results.items():
print(f"Dropout Rate: {dropout_rate} --> Accuracy: {metrics['Test Accuracy']:.2f}, Time: {metrics['Time']:.2f} seconds")
dropout_rates = list(results.keys())
test_accuracies = [metrics["Test Accuracy"] for metrics in results.values()]
times = [metrics["Time"] for metrics in results.values()]
fig, ax = plt.subplots(2, 1, figsize=(14, 7))
sns.barplot(x=dropout_rates, y=test_accuracies, ax=ax[0], palette="Set2")
ax[0].set_title('Test Accuracy vs Dropout Rate')
ax[0].set_xlabel('Dropout Rate')
ax[0].set_ylabel('Test Accuracy')
ax[0].set_ylim(0, 100)
for p in ax[0].patches:
ax[0].annotate(format(p.get_height(), '.2f'),
(p.get_x() + p.get_width() / 2., p.get_height()),
ha='center', va='center', xytext=(0, 9), textcoords='offset points')
sns.barplot(x=dropout_rates, y=times, ax=ax[1], palette="Set2")
ax[1].set_title('Time Taken vs Dropout Rate')
ax[1].set_xlabel('Dropout Rate')
ax[1].set_ylabel('Time (seconds)')
ax[1].grid(axis='y')
for p in ax[1].patches:
ax[1].annotate(format(p.get_height(), '.2f'),
(p.get_x() + p.get_width() / 2., p.get_height()),
ha='center', va='center', xytext=(0, 9), textcoords='offset points')
plt.tight_layout()
plt.show()
Testing Dropout Rate: 0.0 *** Estimated total time for training: 496.70 seconds, 8.28 minutes. *** Total time elapsed: 498.06 sec, 8.30 minutes Testing Dropout Rate: 0.2 *** Estimated total time for training: 421.51 seconds, 7.03 minutes. *** Total time elapsed: 504.22 sec, 8.40 minutes Testing Dropout Rate: 0.4 *** Estimated total time for training: 431.92 seconds, 7.20 minutes. *** Total time elapsed: 498.00 sec, 8.30 minutes Dropout Rate: 0.0 --> Accuracy: 94.88, Time: 498.07 seconds Dropout Rate: 0.2 --> Accuracy: 93.98, Time: 504.22 seconds Dropout Rate: 0.4 --> Accuracy: 93.76, Time: 498.00 seconds
CPU times: total: 2h 6min 29s Wall time: 25min
Summary of Dropout Rate Experiment¶
Experiments were conducted with Dropout Rates of 0.0, 0.2, and 0.4 to observe the effect on accuracy and training time.
Key Findings:¶
- Dropout Rate 0.0 yielded the best accuracy at 94.88% and the fastest training time of 498.07 seconds.
- Higher Dropout Rates (0.2 and 0.4) slightly reduced accuracy, with 93.98% and 93.76%, respectively, and did not significantly impact training time.
Conclusion:¶
A Dropout Rate of 0.0 is optimal for achieving the highest accuracy with efficient training time. ease in training time.
Network Architecture¶
batch_size = 1024
lr = 0.001
gamma = 1.0
doBN = True
dropout_rate = 0.0
L2lambda = 0.0
layer_units_wide = [1024, 1024] # Breadth Net: Wider layers
layer_units_deep = [128, 256, 512, 256, 128] # Depth Net: More layers
results = {}
train_loader = DataLoader(train_dataDataset, batch_size=batch_size, drop_last=True, sampler=sampler)
test_loader = DataLoader(test_dataDataset,batch_size=test_dataDataset.tensors[0].shape[0])
for network_type, layer_units in [("Deep", layer_units_deep), ("Wide", layer_units_wide)]:
print(f"\nTesting {network_type} Network")
start_time = time.time()
net, lossfun, optimizer, scheduler = create_fnn_model(
input_size=input_size, output_size=output_size,
layer_units=layer_units, activation_functions=['relu6']*len(layer_units),
lr=lr, gamma=gamma, step_size=numepochs,
L2lambda=L2lambda, dropout_rate=dropout_rate, doBN=True
)
net, losses, trainAcc, testAcc, yHat = function2trainTheModel(
numepochs=numepochs, train_loader=train_loader,
test_loader=test_loader, net=net, lossfun=lossfun,
optimizer=optimizer, scheduler=scheduler,
computation_metric=compute_accuracy_multi, verbos=0
)
elapsed_time = time.time() - start_time
results[network_type] = {
"Test Accuracy": testAcc[-1],
"Time": elapsed_time
}
for net_type, metrics in results.items():
print(f"{net_type} Network --> Accuracy: {metrics['Test Accuracy']:.2f}, Time: {metrics['Time']:.2f} seconds")
network_types = results.keys()
test_accuracies = [metrics["Test Accuracy"] for metrics in results.values()]
times = [metrics["Time"] for metrics in results.values()]
fig, ax = plt.subplots(2, 1, figsize=(10, 6))
sns.barplot(x=list(network_types), y=test_accuracies, ax=ax[0], palette="Set2")
ax[0].set_title('Test Accuracy: Deep vs. Wide Network')
ax[0].set_xlabel('Network Type')
ax[0].set_ylabel('Test Accuracy')
ax[0].set_ylim(0, 100)
sns.barplot(x=list(network_types), y=times, ax=ax[1], palette="Set2")
ax[1].set_title('Training Time: Deep vs. Wide Network')
ax[1].set_xlabel('Network Type')
ax[1].set_ylabel('Time (seconds)')
ax[1].grid(axis='y')
plt.tight_layout()
plt.show()
Testing Deep Network *** Estimated total time for training: 373.12 seconds, 6.22 minutes. *** Total time elapsed: 453.39 sec, 7.56 minutes Testing Wide Network *** Estimated total time for training: 738.78 seconds, 12.31 minutes. *** Total time elapsed: 806.30 sec, 13.44 minutes Deep Network --> Accuracy: 95.19, Time: 453.39 seconds Wide Network --> Accuracy: 94.69, Time: 806.30 seconds
Experiment Summary of Deep vs. Wide Network Architectures¶
The experiments compared Deep and Wide network architectures to evaluate their impact on model accuracy and training time.
Key Findings:¶
- Deep Network achieved the highest accuracy at 95.19% with a faster training time of 453.39 seconds.
- Wide Network showed slightly lower accuracy at 94.69% and took significantly longer to train at 806.30 seconds.
Conclusion:¶
The Deep Network is the preferred architecture, providing both higher accuracy and faster training.ompared to a wider network.
Hyperparameter Search Summary¶
Throughout the experiments, various combinations of hyperparameters were tested to optimize model accuracy and efficiency. Across different configurations, the results consistently achieved high accuracy (above 93%), showing the robustness of the model under a variety of settings. Although there were no extreme differences between configurations, the exploration process helped identify the best-performing combination.
This hyperparameter search demonstrated the importance of adjusting parameters like batch size, learning rate, gamma, batch normalization, dropout rate, and network architecture to achieve optimal performance. While there are infinite possible configurations, the final results led to a preferred combination.
Additional Notes¶
- The experiment was conducted on a CPU, and using a GPU could significantly speed up the process.
- The FNN model performed well across almost all tested configurations.
Conclusion¶
The best configuration was achieved with:
- Batch Size: 1024
- Learning Rate (lr): 0.001
- Gamma: 1.0
- Batch Normalization (doBN): True
- Dropout Rate: 0.0
- Network Architecture: Deep network
This combination provides an effective balance of accuracy and efficiency for the given task.
Updated Network CONFIG¶
numepochs = 50
doBN = True
batch_size = 1024
learningRate = 0.001
gamma = 1.0
step_size = numepochs
dropout_rate = 0.0
L2lambda = 0.0
layer_units = [ 128, 256, 512, 256, 128 ]
activation_functions = ['relu'] * len(layer_units)
X_train_scaled_tensor = torch.tensor(X_train_scaled.values, dtype=torch.float32)
y_train_tensor = torch.tensor(y_train.values, dtype=torch.long)
X_test_scaled_tensor = torch.tensor(X_test_scaled.values, dtype=torch.float32)
y_test_tensor = torch.tensor(y_test.values, dtype=torch.long)
train_dataDataset = TensorDataset(X_train_scaled_tensor, y_train_tensor)
test_dataDataset = TensorDataset(X_test_scaled_tensor, y_test_tensor)
sample_weights = compute_sample_weight(class_weight='balanced', y=y_train)
sample_weights_tensor = torch.tensor(sample_weights, dtype=torch.float32)
sampler = WeightedRandomSampler(weights=sample_weights_tensor, num_samples=len(sample_weights_tensor), replacement=True)
train_loader = DataLoader(train_dataDataset, batch_size=batch_size, drop_last=True, sampler=sampler)
test_loader = DataLoader(test_dataDataset,batch_size=test_dataDataset.tensors[0].shape[0])
input_size = train_loader.dataset.tensors[0].shape[1]
output_size = len(y_train.unique()) # 4
Next Steps - Model Training with Best Parameters¶
In this section, we will run the model using the optimal hyperparameters identified earlier. We will conduct two experiments:
- Model Training with All Features
- Model Training with BorutaPy Selected Features
All Features¶
net, lossfun, optimizer, scheduler = create_fnn_model(input_size, output_size, layer_units, activation_functions , learningRate, gamma, step_size, L2lambda, dropout_rate, doBN)
net, losses, trainAcc, testAcc, yHat = function2trainTheModel(numepochs, train_loader, test_loader, net, lossfun, optimizer, scheduler, compute_accuracy_multi)
plot_training_metrics(losses, trainAcc, testAcc, 'Accuracy For All Features')
predicted_classes = torch.argmax(yHat, dim=1).cpu().numpy()
check_accuracy(y_test, predicted_classes, 'All Features')
*** Epoch 1, Step Size: 0, Learning Rate: 0.001 *** Epoch 1/50, Loss: 0.2520, elapsed time: 17.55 sec *** Estimated total time for training: 730.51 seconds, 12.18 minutes. *** Epoch 2/50, Loss: 0.1843, elapsed time: 14.61 sec Epoch 3/50, Loss: 0.1646, elapsed time: 14.05 sec Epoch 4/50, Loss: 0.1502, elapsed time: 14.01 sec Epoch 5/50, Loss: 0.1417, elapsed time: 14.14 sec Epoch 6/50, Loss: 0.1343, elapsed time: 13.65 sec Epoch 7/50, Loss: 0.1288, elapsed time: 16.11 sec Epoch 8/50, Loss: 0.1233, elapsed time: 13.84 sec Epoch 9/50, Loss: 0.1193, elapsed time: 13.94 sec Epoch 10/50, Loss: 0.1171, elapsed time: 13.81 sec Epoch 11/50, Loss: 0.1115, elapsed time: 13.63 sec Epoch 12/50, Loss: 0.1104, elapsed time: 13.73 sec Epoch 13/50, Loss: 0.1065, elapsed time: 13.72 sec Epoch 14/50, Loss: 0.1051, elapsed time: 13.78 sec Epoch 15/50, Loss: 0.1027, elapsed time: 12.96 sec Epoch 16/50, Loss: 0.0996, elapsed time: 12.88 sec Epoch 17/50, Loss: 0.0977, elapsed time: 13.87 sec Epoch 18/50, Loss: 0.0957, elapsed time: 13.61 sec Epoch 19/50, Loss: 0.0949, elapsed time: 13.45 sec Epoch 20/50, Loss: 0.0933, elapsed time: 12.83 sec Epoch 21/50, Loss: 0.0910, elapsed time: 15.65 sec Epoch 22/50, Loss: 0.0899, elapsed time: 15.83 sec Epoch 23/50, Loss: 0.0890, elapsed time: 15.25 sec Epoch 24/50, Loss: 0.0877, elapsed time: 15.23 sec Epoch 25/50, Loss: 0.0863, elapsed time: 15.76 sec Epoch 26/50, Loss: 0.0846, elapsed time: 17.00 sec Epoch 27/50, Loss: 0.0846, elapsed time: 14.98 sec Epoch 28/50, Loss: 0.0823, elapsed time: 14.20 sec Epoch 29/50, Loss: 0.0820, elapsed time: 14.89 sec Epoch 30/50, Loss: 0.0806, elapsed time: 20.45 sec Epoch 31/50, Loss: 0.0795, elapsed time: 20.40 sec Epoch 32/50, Loss: 0.0790, elapsed time: 15.66 sec Epoch 33/50, Loss: 0.0776, elapsed time: 15.83 sec Epoch 34/50, Loss: 0.0763, elapsed time: 15.83 sec Epoch 35/50, Loss: 0.0769, elapsed time: 15.20 sec Epoch 36/50, Loss: 0.0755, elapsed time: 20.76 sec Epoch 37/50, Loss: 0.0746, elapsed time: 18.93 sec Epoch 38/50, Loss: 0.0735, elapsed time: 18.07 sec Epoch 39/50, Loss: 0.0736, elapsed time: 16.63 sec Epoch 40/50, Loss: 0.0730, elapsed time: 14.97 sec Epoch 41/50, Loss: 0.0723, elapsed time: 18.76 sec Epoch 42/50, Loss: 0.0709, elapsed time: 13.29 sec Epoch 43/50, Loss: 0.0710, elapsed time: 14.25 sec Epoch 44/50, Loss: 0.0704, elapsed time: 17.97 sec Epoch 45/50, Loss: 0.0704, elapsed time: 17.74 sec Epoch 46/50, Loss: 0.0689, elapsed time: 20.39 sec Epoch 47/50, Loss: 0.0680, elapsed time: 16.52 sec Epoch 48/50, Loss: 0.0679, elapsed time: 14.49 sec Epoch 49/50, Loss: 0.0680, elapsed time: 13.22 sec Epoch 50/50, Loss: 0.0671, elapsed time: 12.91 sec Total time elapsed: 890.19 sec, 14.84 minutes
BorutaPy Selected Features¶
boruta_net, boruta_lossfun, optimizer, scheduler = create_fnn_model(input_size, output_size, layer_units, activation_functions , learningRate, gamma, step_size, L2lambda, dropout_rate, doBN)
boruta_net, boruta_losses, boruta_trainAcc,boruta_testAcc, boruta_yHat = function2trainTheModel(numepochs, train_loader, test_loader, boruta_net, lossfun, optimizer, scheduler, compute_accuracy_multi)
plot_training_metrics(boruta_losses, boruta_trainAcc, boruta_testAcc, 'Accuracy For BorutaPy Features')
predicted_classes = torch.argmax(boruta_yHat, dim=1).cpu().numpy()
check_accuracy(y_test, predicted_classes, 'BorutaPy Selected Features')
*** Epoch 1, Step Size: 0, Learning Rate: 0.001 *** Epoch 1/50, Loss: 0.2465, elapsed time: 12.86 sec *** Estimated total time for training: 635.34 seconds, 10.59 minutes. *** Epoch 2/50, Loss: 0.1811, elapsed time: 12.71 sec Epoch 3/50, Loss: 0.1628, elapsed time: 14.05 sec Epoch 4/50, Loss: 0.1511, elapsed time: 14.53 sec Epoch 5/50, Loss: 0.1417, elapsed time: 14.21 sec Epoch 6/50, Loss: 0.1342, elapsed time: 12.84 sec Epoch 7/50, Loss: 0.1291, elapsed time: 12.86 sec Epoch 8/50, Loss: 0.1244, elapsed time: 13.60 sec Epoch 9/50, Loss: 0.1205, elapsed time: 12.81 sec Epoch 10/50, Loss: 0.1156, elapsed time: 12.88 sec Epoch 11/50, Loss: 0.1125, elapsed time: 13.27 sec Epoch 12/50, Loss: 0.1103, elapsed time: 14.05 sec Epoch 13/50, Loss: 0.1080, elapsed time: 14.72 sec Epoch 14/50, Loss: 0.1038, elapsed time: 13.60 sec Epoch 15/50, Loss: 0.1036, elapsed time: 15.36 sec Epoch 16/50, Loss: 0.0995, elapsed time: 13.60 sec Epoch 17/50, Loss: 0.0984, elapsed time: 15.18 sec Epoch 18/50, Loss: 0.0962, elapsed time: 14.50 sec Epoch 19/50, Loss: 0.0953, elapsed time: 12.91 sec Epoch 20/50, Loss: 0.0938, elapsed time: 12.94 sec Epoch 21/50, Loss: 0.0915, elapsed time: 12.59 sec Epoch 22/50, Loss: 0.0896, elapsed time: 12.55 sec Epoch 23/50, Loss: 0.0896, elapsed time: 12.64 sec Epoch 24/50, Loss: 0.0871, elapsed time: 13.19 sec Epoch 25/50, Loss: 0.0865, elapsed time: 12.67 sec Epoch 26/50, Loss: 0.0855, elapsed time: 13.77 sec Epoch 27/50, Loss: 0.0829, elapsed time: 12.55 sec Epoch 28/50, Loss: 0.0827, elapsed time: 12.92 sec Epoch 29/50, Loss: 0.0831, elapsed time: 12.88 sec Epoch 30/50, Loss: 0.0799, elapsed time: 12.66 sec Epoch 31/50, Loss: 0.0803, elapsed time: 12.64 sec Epoch 32/50, Loss: 0.0791, elapsed time: 12.52 sec Epoch 33/50, Loss: 0.0780, elapsed time: 13.06 sec Epoch 34/50, Loss: 0.0775, elapsed time: 13.67 sec Epoch 35/50, Loss: 0.0764, elapsed time: 16.23 sec Epoch 36/50, Loss: 0.0754, elapsed time: 13.13 sec Epoch 37/50, Loss: 0.0739, elapsed time: 13.08 sec Epoch 38/50, Loss: 0.0738, elapsed time: 12.60 sec Epoch 39/50, Loss: 0.0741, elapsed time: 12.70 sec Epoch 40/50, Loss: 0.0725, elapsed time: 12.54 sec Epoch 41/50, Loss: 0.0721, elapsed time: 12.92 sec Epoch 42/50, Loss: 0.0701, elapsed time: 12.96 sec Epoch 43/50, Loss: 0.0707, elapsed time: 14.37 sec Epoch 44/50, Loss: 0.0696, elapsed time: 12.52 sec Epoch 45/50, Loss: 0.0700, elapsed time: 12.82 sec Epoch 46/50, Loss: 0.0685, elapsed time: 13.09 sec Epoch 47/50, Loss: 0.0689, elapsed time: 12.53 sec Epoch 48/50, Loss: 0.0694, elapsed time: 12.75 sec Epoch 49/50, Loss: 0.0675, elapsed time: 12.58 sec Epoch 50/50, Loss: 0.0671, elapsed time: 13.10 sec Total time elapsed: 775.02 sec, 12.92 minutes
Summary of Network Results¶
In this experiment, two different network configurations were trained and tested:
- All Features (slightly better performance)
- BorutaPy Selected Features
Results Overview:¶
Good Accuracy: Both models demonstrated strong overall accuracy in detecting malicious URLs and performed competitively in a relatively short training time.
Benign Class:
- The models achieved around 95% accuracy, indicating robust performance in detecting benign URLs with a low misclassification rate of 4.29%.
Defacement Class:
- The models reached around 99% accuracy, reflecting their high effectiveness in identifying defacement.
Malware Class:
- Both models showed around 93% accuracy, demonstrating solid performance in detecting malware.
Phishing Class:
- Accuracy for the phishing class was slightly lower, with results around 91%, indicating this class remains a challenging area that may benefit from further optimization and feature engineering.
Conclusion¶
The All Features model performed slightly better than the BorutaPy Selected Features model, particularly in capturing subtle differences between classes. Both models achieved strong results across most classes with efficient training times. However, consistent with previous findings, the phishing class remains a weak point, which requires further refinement for improved accuracy. Misclassification rates were low for benign URLs, and misclassifications between malware, defacement, and phishing were relatively minor.
FNN with BERT¶
What is BERT?¶
BERT (Bidirectional Encoder Representations from Transformers) is a pre-trained language model designed to understand context by looking at text in both directions (left-to-right and right-to-left). It is widely used for tasks like text classification, question answering, and feature extraction due to its ability to capture deep, contextual meaning from words and phrases.
What is bert-base-uncased?¶
bert-base-uncased is a version of BERT with 12 layers and 110 million parameters, trained on lowercased English text. It treats text as case-insensitive, meaning all inputs are converted to lowercase, making it efficient for handling tasks where capitalization isn’t important (like URLs). Practically, BERT extracts 768-dimensional feature vectors for each input, allowing it to represent the nuanced semantics of the text.
Why Use bert-base-uncased for URL Detection?¶
Contextual Understanding: URLs often contain subtle patterns that can indicate malicious behavior. BERT’s ability to capture token relationships allows it to detect these nuances effectively.
No Need for Feature Engineering: Instead of manually designing features, BERT automatically extracts rich, meaningful representations from URLs, which can then be classified as phishing, malware, or other categories.
Scalable: Pre-trained models like
bert-base-uncasedcan be fine-tuned for specific tasks (like URL detection) with less data, making it efficient and scalable for large datasets.Bidirectional Analysis: BERT processes text in both directions, helping capture important patterns across the entire URL, whether they occur at the start or end of the string.
from transformers import BertModel, BertTokenizer
model = BertModel.from_pretrained('bert-base-uncased', output_hidden_states=True)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
%%time
import torch
def extract_bert_features_batch(urls):
inputs = tokenizer(urls, return_tensors='pt', padding=True, truncation=True, max_length=133)
input_ids = inputs['input_ids']
attention_mask = inputs['attention_mask']
with torch.no_grad():
outputs = model(input_ids, attention_mask=attention_mask)
token_embeddings = outputs.last_hidden_state
sentence_embeddings = torch.mean(token_embeddings, dim=1)
return sentence_embeddings
batch_size = 32
bert_features = []
for i in range(0, len(df), batch_size):
batch_urls = df.reset_index().iloc[i:i + batch_size]['url'].tolist()
batch_features = extract_bert_features_batch(batch_urls)
bert_features.append(batch_features)
bert_features = torch.cat(bert_features, dim=0).numpy()
print("BERT Features Shape:", bert_features.shape)
print("Target Shape:", df['type'].shape)
BERT Features Shape: (608510, 768) Target Shape: (608510,) CPU times: total: 2d 15h 19min 41s Wall time: 11h 53min
from sklearn.model_selection import train_test_split
bert_features_df = pd.DataFrame(bert_features)
labels_df = pd.DataFrame(df['type'].values, columns=['type'])
bert_df = pd.concat([bert_features_df, labels_df], axis=1)
X = bert_df.drop(columns=['type'])
y = bert_df['type']
X_train_bert, X_test_bert, y_train_bert, y_test_bert = train_test_split(X, y, test_size=0.2, random_state=RANDOM_STATE)
print(X_train_bert.shape, X_test_bert.shape, y_train_bert.shape, y_test_bert.shape)
(486808, 768) (121702, 768) (486808,) (121702,)
bert_ds = X_train_bert.describe().T
bert_ds
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| 0 | 486808.0 | 0.186599 | 0.142357 | -0.756574 | 0.096685 | 0.187011 | 0.275938 | 0.900724 |
| 1 | 486808.0 | -0.079289 | 0.137129 | -0.801882 | -0.171787 | -0.079434 | 0.012914 | 0.775830 |
| 2 | 486808.0 | 0.317695 | 0.138945 | -0.387983 | 0.225834 | 0.309290 | 0.403508 | 1.086027 |
| 3 | 486808.0 | 0.041942 | 0.163707 | -0.714411 | -0.057832 | 0.060781 | 0.155435 | 0.736621 |
| 4 | 486808.0 | 0.292893 | 0.123600 | -0.782637 | 0.212815 | 0.291612 | 0.372815 | 0.985944 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 763 | 486808.0 | -0.152188 | 0.161910 | -0.962243 | -0.263471 | -0.173447 | -0.053416 | 0.589963 |
| 764 | 486808.0 | -0.039436 | 0.123828 | -0.688595 | -0.119005 | -0.037260 | 0.045138 | 0.534348 |
| 765 | 486808.0 | -0.075762 | 0.121400 | -0.894989 | -0.153326 | -0.070693 | 0.003862 | 0.498655 |
| 766 | 486808.0 | -0.068985 | 0.124215 | -0.673868 | -0.150809 | -0.066869 | 0.012941 | 0.798272 |
| 767 | 486808.0 | 0.137579 | 0.138483 | -0.535462 | 0.052473 | 0.138289 | 0.222824 | 0.972031 |
768 rows × 8 columns
y_train_bert = y_train_bert.apply(lambda x: encoding_map[x])
y_test_bert = y_test_bert.apply(lambda x: encoding_map[x])
X_train_bert_tensor = torch.tensor(X_train_bert.values, dtype=torch.float32)
y_train_bert_tensor = torch.tensor(y_train_bert.values, dtype=torch.long)
X_test_bert_tensor = torch.tensor(X_test_bert.values, dtype=torch.float32)
y_test_bert_tensor = torch.tensor(y_test_bert.values, dtype=torch.long)
train_dataDataset = TensorDataset(X_train_bert_tensor, y_train_bert_tensor)
test_dataDataset = TensorDataset(X_test_bert_tensor, y_test_bert_tensor)
train_loader = DataLoader(train_dataDataset,batch_size=batch_size, shuffle=True, drop_last=True)
test_loader = DataLoader(test_dataDataset,batch_size=test_dataDataset.tensors[0].shape[0])
input_size = train_loader.dataset.tensors[0].shape[1]
output_size = len(y_train.unique()) # 4
step_size = 30
gamma = 0.8
dropout_rate = 0.2
net, lossfun, optimizer, scheduler = create_fnn_model(input_size, output_size, layer_units, activation_functions , learningRate, gamma, step_size, L2lambda, dropout_rate, doBN)
net, losses, trainAcc, testAcc, yHat = function2trainTheModel(numepochs, train_loader, test_loader, net, lossfun, optimizer, scheduler, compute_accuracy_multi)
plot_training_metrics(losses, trainAcc, testAcc, 'Accuracy For All Features')
predicted_classes = torch.argmax(yHat, dim=1).cpu().numpy()
check_accuracy(y_test_bert, predicted_classes, 'All Features')
*** Epoch 1, Step Size: 0, Learning Rate: 0.001 *** Epoch 1/50, Loss: 0.1427, elapsed time: 188.30 sec *** Estimated total time for training: 9468.65 seconds, 157.81 minutes. *** Epoch 2/50, Loss: 0.0958, elapsed time: 189.37 sec Epoch 3/50, Loss: 0.0825, elapsed time: 187.48 sec Epoch 4/50, Loss: 0.0749, elapsed time: 186.15 sec Epoch 5/50, Loss: 0.0695, elapsed time: 186.40 sec Epoch 6/50, Loss: 0.0654, elapsed time: 185.68 sec Epoch 7/50, Loss: 0.0621, elapsed time: 184.97 sec Epoch 8/50, Loss: 0.0592, elapsed time: 187.07 sec Epoch 9/50, Loss: 0.0571, elapsed time: 189.39 sec Epoch 10/50, Loss: 0.0546, elapsed time: 187.32 sec Epoch 11/50, Loss: 0.0534, elapsed time: 188.09 sec Epoch 12/50, Loss: 0.0511, elapsed time: 189.27 sec Epoch 13/50, Loss: 0.0493, elapsed time: 187.42 sec Epoch 14/50, Loss: 0.0481, elapsed time: 187.98 sec Epoch 15/50, Loss: 0.0470, elapsed time: 185.96 sec Epoch 16/50, Loss: 0.0452, elapsed time: 187.40 sec Epoch 17/50, Loss: 0.0442, elapsed time: 184.66 sec Epoch 18/50, Loss: 0.0435, elapsed time: 186.69 sec Epoch 19/50, Loss: 0.0427, elapsed time: 187.27 sec Epoch 20/50, Loss: 0.0416, elapsed time: 187.89 sec Epoch 21/50, Loss: 0.0404, elapsed time: 188.09 sec Epoch 22/50, Loss: 0.0398, elapsed time: 186.92 sec Epoch 23/50, Loss: 0.0390, elapsed time: 192.95 sec Epoch 24/50, Loss: 0.0379, elapsed time: 191.48 sec Epoch 25/50, Loss: 0.0373, elapsed time: 187.20 sec Epoch 26/50, Loss: 0.0365, elapsed time: 187.88 sec Epoch 27/50, Loss: 0.0362, elapsed time: 335.88 sec Epoch 28/50, Loss: 0.0353, elapsed time: 261.89 sec Epoch 29/50, Loss: 0.0347, elapsed time: 292.15 sec Epoch 30/50, Loss: 0.0342, elapsed time: 272.56 sec *** Epoch 31, Step Size: 30, Learning Rate: 0.0008 *** Epoch 31/50, Loss: 0.0317, elapsed time: 184.13 sec Epoch 32/50, Loss: 0.0301, elapsed time: 186.29 sec Epoch 33/50, Loss: 0.0300, elapsed time: 187.48 sec Epoch 34/50, Loss: 0.0295, elapsed time: 188.71 sec Epoch 35/50, Loss: 0.0290, elapsed time: 185.94 sec Epoch 36/50, Loss: 0.0285, elapsed time: 204.32 sec Epoch 37/50, Loss: 0.0279, elapsed time: 212.09 sec Epoch 38/50, Loss: 0.0278, elapsed time: 220.62 sec Epoch 39/50, Loss: 0.0272, elapsed time: 210.81 sec Epoch 40/50, Loss: 0.0264, elapsed time: 220.84 sec Epoch 41/50, Loss: 0.0264, elapsed time: 257.80 sec Epoch 42/50, Loss: 0.0266, elapsed time: 219.34 sec Epoch 43/50, Loss: 0.0260, elapsed time: 186.69 sec Epoch 44/50, Loss: 0.0256, elapsed time: 199.13 sec Epoch 45/50, Loss: 0.0253, elapsed time: 184.95 sec Epoch 46/50, Loss: 0.0249, elapsed time: 211.25 sec Epoch 47/50, Loss: 0.0248, elapsed time: 186.31 sec Epoch 48/50, Loss: 0.0245, elapsed time: 185.64 sec Epoch 49/50, Loss: 0.0243, elapsed time: 188.40 sec Epoch 50/50, Loss: 0.0239, elapsed time: 185.36 sec Total time elapsed: 10362.19 sec, 172.70 minutes
Project Summary¶
Overview¶
This project aimed to develop a highly accurate model for detecting malicious URLs using both machine learning (ML) and deep learning techniques. The workflow included data preprocessing, feature extraction, model training, and evaluation, culminating in excellent results.
Key Achievements¶
Feature Extraction:
- Key features such as URL length, special character counts, suspicious keywords, and n-gram patterns significantly contributed to classification accuracy.
Model Performance:
- Traditional models (XGBoost, LightGBM, CatBoost) performed well across feature sets, while deep learning models, especially the BERT-based FNN, achieved the highest overall accuracy of 98%. The model excelled at identifying
BenignURLs with 99% accuracy, which was crucial to the primary goal of distinguishing between benign and malicious URLs effectively.
- Traditional models (XGBoost, LightGBM, CatBoost) performed well across feature sets, while deep learning models, especially the BERT-based FNN, achieved the highest overall accuracy of 98%. The model excelled at identifying
Result:
- The BERT-based FNN model demonstrated the best results, achieving 99% accuracy for detecting malicious URLs.
Future Directions¶
- System Integration: Incorporate the model into a security framework, cross-referencing new URLs against a database of known safe URLs.
- Continuous Monitoring: Utilize external tools like
"VirusTotal"to enhance URL safety verification. - Ongoing Data Collection: Regularly collect new data to adapt to emerging threats, with scheduled re-training every 6–12 months to maintain high accuracy as hacking methods evolve.
Conclusion¶
Deep learning, particularly the BERT-based FNN, is highly effective for malicious URL detection, providing a robust "second firewall" in cybersecurity defenses that operates independently of specific systems. The model’s strength in accurately classifying benign URLs demonstrates its value in differentiating malicious from safe URLs, fulfilling the project’s core objective.